Comparative analysis of summarization models on a variety of everyday documents
Dataset host/upload for SummComparer . This is just a hosting page, check the repo for the latest info.
Outside of a basic EDA colab notebook some static sites powered via pandas-profiling :
Note:** The current version of the dataset is still largely in a "raw" format. It has seen some basic cleaning but may need more in the future.
In the repo, the dataset is split into two different tables. One contains the original documents with long text & IDs etc, and the other contains everything else.
If you are joining the two, join on source_doc_id . Here, they have already been merged for you. You can load as use the dataset from here:
from datasets import load_dataset dataset = load_dataset("pszemraj/summcomparer-gauntlet-v0p1",) dataset
which should output (for v0.1.2 ):
DatasetDict( { train: Dataset( { features: [ "GAUNTLET_PATH", "file_name", "summary", "min_length", "max_length", "no_repeat_ngram_size", "encoder_no_repeat_ngram_size", "repetition_penalty", "num_beams", "num_beam_groups", "length_penalty", "early_stopping", "do_sample", "model_name", "date", "length", "format", "extractiveness", "temperature", "token_batch_length", "penalty_alpha", "top_k", "batch_stride", "max_len_ratio", "directory-topic-tag", "runtime", "source_doc_filename", "source_doc_id", "source_doc_domain", "document_text", ], num_rows: 2043, } ) } )
This dataset does contain reference summaries generated by GPT-4 and GPT-3.5-turbo. While it shouldn't be an issue as this is meant for analysis and not training , please note that the OpenAI generated text is subject to their terms of use.
This data can be filtered out/dropped if needed/relevant for your use of the data.