数据集:

ccdv/govreport-summarization

中文

GovReport dataset for summarization

Dataset for summarization of long documents. Adapted from this repo and this paper This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable:

"ccdv/govreport-summarization": ("report", "summary")

Data Fields

  • id : paper id
  • report : a string containing the body of the report
  • summary : a string containing the summary of the report

Data Splits

This dataset has 3 splits: train , validation , and test . Token counts with a RoBERTa tokenizer.

Dataset Split Number of Instances Avg. tokens
Train 17,517 < 9,000 / < 500
Validation 973 < 9,000 / < 500
Test 973 < 9,000 / < 500

Cite original article

@misc{huang2021efficient,
      title={Efficient Attentions for Long Document Summarization}, 
      author={Luyang Huang and Shuyang Cao and Nikolaus Parulian and Heng Ji and Lu Wang},
      year={2021},
      eprint={2104.02112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
    }