数据集:
ccdv/arxiv-summarization
长文档摘要的数据集。改编自此 repo 。注意,原始数据已经预先进行了分词处理,因此该数据集返回 " ".join(text) 并且在段落之间添加"\n"。如果您将以下行添加到 summarization_name_mapping 变量中,则此数据集与Transformers库中的 run_summarization.py 脚本兼容:
"ccdv/arxiv-summarization": ("article", "abstract")
该数据集有三个拆分:训练集、验证集和测试集。标记计数基于空格。
Dataset Split | Number of Instances | Avg. tokens |
---|---|---|
Train | 203,037 | 6038 / 299 |
Validation | 6,436 | 5894 / 172 |
Test | 6,440 | 5905 / 174 |
@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }