MediaSum dataset for summarization

Summarization dataset copied from MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable:

"ccdv/mediasum": ("document", "summary")

Configs

4 possibles configs:

roberta will concatenate documents with "</s>"
newline will concatenate documents with "\n"
bert will concatenate documents with "[SEP]"
list will return the list of documents instead of a single string

Add _prepended to config name to prepend the speaker name before each dialogue: speaker: text Default is roberta_prepended (compatible with BART).

Data Fields

id : paper id
document : a string/list containing the body of a set of documents
summary : a string containing the abstract of the set

Data Splits

This dataset has 3 splits: train , validation , and test . \

Dataset Split	Number of Instances
Train	443596
Validation	10000
Test	10000

Cite original article

@article{zhu2021mediasum,
  title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
  author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
  journal={arXiv preprint arXiv:2103.06410},
  year={2021}
}

作者:

ccdv

数据集大小:

1.41 GB