数据集:

scientific_papers

任务:

摘要生成

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:1804.05685

其他:

abstractive-summarization

许可:

license:unknown

数据集介绍文件清单

英文

"scientific_papers"数据集的数据卡片

数据集概要

科学论文数据集包含两组长且结构化的文档。数据集来自于ArXiv和PubMed OpenAccess仓库。

"arxiv"和"pubmed"都有两个特征：

article：文档正文，段落以"/n"分隔。
abstract：文档摘要，段落以"/n"分隔。
section_names：章节标题，以"/n"分隔。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

arxiv

下载的数据集文件大小：4.50 GB
生成的数据集大小：7.58 GB
总计使用的磁盘空间：12.09 GB

'train'的一个示例如下所示。

This example was too long and was cropped:

{
    "abstract": "\" we have studied the leptonic decay @xmath0 , via the decay channel @xmath1 , using a sample of tagged @xmath2 decays collected...",
    "article": "\"the leptonic decays of a charged pseudoscalar meson @xmath7 are processes of the type @xmath8 , where @xmath9 , @xmath10 , or @...",
    "section_names": "[sec:introduction]introduction\n[sec:detector]data and the cleo- detector\n[sec:analysys]analysis method\n[sec:conclusion]summary"
}

pubmed

下载的数据集文件大小：4.50 GB
生成的数据集大小：2.51 GB
总计使用的磁盘空间：7.01 GB

'validation'的一个示例如下所示。

This example was too long and was cropped:

{
    "abstract": "\" background and aim : there is lack of substantial indian data on venous thromboembolism ( vte ) . \\n the aim of this study was...",
    "article": "\"approximately , one - third of patients with symptomatic vte manifests pe , whereas two - thirds manifest dvt alone .\\nboth dvt...",
    "section_names": "\"Introduction\\nSubjects and Methods\\nResults\\nDemographics and characteristics of venous thromboembolism patients\\nRisk factors ..."
}

数据字段

所有拆分的数据字段相同。

arxiv

article：字符串特征。
abstract：字符串特征。
section_names：字符串特征。

pubmed

article：字符串特征。
abstract：字符串特征。
section_names：字符串特征。

数据拆分

name	train	validation	test
arxiv	203037	6436	6440
pubmed	119924	6633	6658

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

谁是源语言的生产者？

More Information Needed

注释

注释过程

More Information Needed

注释者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

其他信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@article{Cohan_2018,
   title={A Discourse-Aware Attention Model for Abstractive Summarization of
            Long Documents},
   url={http://dx.doi.org/10.18653/v1/n18-2097},
   DOI={10.18653/v1/n18-2097},
   journal={Proceedings of the 2018 Conference of the North American Chapter of
          the Association for Computational Linguistics: Human Language
          Technologies, Volume 2 (Short Papers)},
   publisher={Association for Computational Linguistics},
   author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
   year={2018}
}

贡献者

感谢 @thomwolf ， @jplu ， @lewtun ， @patrickvonplaten 添加此数据集。

作者:

佚名

数据集大小:

19.32 KB

"scientific_papers"数据集的数据卡片

数据集概要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

讨论偏见

其他已知限制

其他信息

数据集策划者

许可信息

引用信息

贡献者