数据集:
ccdv/govreport-summarization
Dataset for summarization of long documents. Adapted from this repo and this paper This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable:
"ccdv/govreport-summarization": ("report", "summary")
This dataset has 3 splits: train , validation , and test . Token counts with a RoBERTa tokenizer.
Dataset Split | Number of Instances | Avg. tokens |
---|---|---|
Train | 17,517 | < 9,000 / < 500 |
Validation | 973 | < 9,000 / < 500 |
Test | 973 | < 9,000 / < 500 |
@misc{huang2021efficient, title={Efficient Attentions for Long Document Summarization}, author={Luyang Huang and Shuyang Cao and Nikolaus Parulian and Heng Ji and Lu Wang}, year={2021}, eprint={2104.02112}, archivePrefix={arXiv}, primaryClass={cs.CL} }