SummComparer - v0.1 version

Comparative analysis of summarization models on a variety of everyday documents

Dataset host/upload for SummComparer . This is just a hosting page, check the repo for the latest info.

This is a work in progress and will be updated over time.
PRs/discussions on this card are disabled, but discussions/ideas/analysis etc are welcome, just post in the github repo discussions so things are all in one place.
Please note that this is a dataset intended for analyzing the summary quality of different models rather than something to train more models on.

EDA links

Outside of a basic EDA colab notebook some static sites powered via pandas-profiling :

Working with the dataset

Note:** The current version of the dataset is still largely in a "raw" format. It has seen some basic cleaning but may need more in the future.

In the repo, the dataset is split into two different tables. One contains the original documents with long text & IDs etc, and the other contains everything else.

input_documents.parquet : This file contains the input documents for the gauntlet along with metadata/ id fields as defined in gauntlet_master_data.json .
gauntlet_summaries.parquet : This file contains the output summaries for the gauntlet with hyperparameters/models as columns. All summaries (rows) are mapped to their source documents (columns) by columns prefixed with source_doc .

If you are joining the two, join on source_doc_id . Here, they have already been merged for you. You can load as use the dataset from here:

from datasets import load_dataset

dataset = load_dataset("pszemraj/summcomparer-gauntlet-v0p1",)
dataset

which should output (for v0.1.2 ):

DatasetDict(
    {
        train: Dataset(
            {
                features: [
                    "GAUNTLET_PATH",
                    "file_name",
                    "summary",
                    "min_length",
                    "max_length",
                    "no_repeat_ngram_size",
                    "encoder_no_repeat_ngram_size",
                    "repetition_penalty",
                    "num_beams",
                    "num_beam_groups",
                    "length_penalty",
                    "early_stopping",
                    "do_sample",
                    "model_name",
                    "date",
                    "length",
                    "format",
                    "extractiveness",
                    "temperature",
                    "token_batch_length",
                    "penalty_alpha",
                    "top_k",
                    "batch_stride",
                    "max_len_ratio",
                    "directory-topic-tag",
                    "runtime",
                    "source_doc_filename",
                    "source_doc_id",
                    "source_doc_domain",
                    "document_text",
                ],
                num_rows: 2043,
            }
        )
    }
)

OpenAI Terms of Use Notice

This dataset does contain reference summaries generated by GPT-4 and GPT-3.5-turbo. While it shouldn't be an issue as this is meant for analysis and not training , please note that the OpenAI generated text is subject to their terms of use.

This data can be filtered out/dropped if needed/relevant for your use of the data.

作者:

pszemraj

数据集大小:

2.84 MB