PubMed数据集的摘要

用于长文档摘要的数据集。从这个文章 repo 进行了改编。请注意，原始数据已进行预分词，因此此数据集返回" ".join(text)，并在段落之间添加"\n"。如果您将以下行添加到Transformers的summarization_name_mapping变量中，此数据集与 run_summarization.py 脚本兼容：

"ccdv/pubmed-summarization": ("article", "abstract")

数据字段

id : 文章编号
article : 包含论文主体的字符串
abstract : 包含论文摘要的字符串

数据拆分

此数据集有3个拆分：train，validation和test。令牌计数基于空格。

Dataset Split	Number of Instances	Avg. tokens
Train	119,924	3043 / 215
Validation	6,633	3111 / 216
Test	6,658	3092 / 219

引用原文

@inproceedings{cohan-etal-2018-discourse,
  title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents",
  author = "Cohan, Arman  and
    Dernoncourt, Franck  and
    Kim, Doo Soon  and
    Bui, Trung  and
    Kim, Seokhwan  and
    Chang, Walter  and
    Goharian, Nazli",
  booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)",
  month = jun,
  year = "2018",
  address = "New Orleans, Louisiana",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/N18-2097",
  doi = "10.18653/v1/N18-2097",
  pages = "615--621",
  abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.",
}

作者:

ccdv

数据集大小:

826.89 MB