数据集:
ccdv/mediasum
Summarization dataset copied from MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization
This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable:
"ccdv/mediasum": ("document", "summary")
4 possibles configs:
Add _prepended to config name to prepend the speaker name before each dialogue: speaker: text Default is roberta_prepended (compatible with BART).
This dataset has 3 splits: train , validation , and test . \
Dataset Split | Number of Instances |
---|---|
Train | 443596 |
Validation | 10000 |
Test | 10000 |
@article{zhu2021mediasum, title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization}, author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael}, journal={arXiv preprint arXiv:2103.06410}, year={2021} }