数据集:
scientific_papers
任务:
摘要生成语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1804.05685许可:
license:unknown科学论文数据集包含两组长且结构化的文档。数据集来自于ArXiv和PubMed OpenAccess仓库。
"arxiv"和"pubmed"都有两个特征:
'train'的一个示例如下所示。
This example was too long and was cropped: { "abstract": "\" we have studied the leptonic decay @xmath0 , via the decay channel @xmath1 , using a sample of tagged @xmath2 decays collected...", "article": "\"the leptonic decays of a charged pseudoscalar meson @xmath7 are processes of the type @xmath8 , where @xmath9 , @xmath10 , or @...", "section_names": "[sec:introduction]introduction\n[sec:detector]data and the cleo- detector\n[sec:analysys]analysis method\n[sec:conclusion]summary" }pubmed
'validation'的一个示例如下所示。
This example was too long and was cropped: { "abstract": "\" background and aim : there is lack of substantial indian data on venous thromboembolism ( vte ) . \\n the aim of this study was...", "article": "\"approximately , one - third of patients with symptomatic vte manifests pe , whereas two - thirds manifest dvt alone .\\nboth dvt...", "section_names": "\"Introduction\\nSubjects and Methods\\nResults\\nDemographics and characteristics of venous thromboembolism patients\\nRisk factors ..." }
所有拆分的数据字段相同。
arxivname | train | validation | test |
---|---|---|---|
arxiv | 203037 | 6436 | 6440 |
pubmed | 119924 | 6633 | 6658 |
@article{Cohan_2018, title={A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents}, url={http://dx.doi.org/10.18653/v1/n18-2097}, DOI={10.18653/v1/n18-2097}, journal={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)}, publisher={Association for Computational Linguistics}, author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli}, year={2018} }
感谢 @thomwolf , @jplu , @lewtun , @patrickvonplaten 添加此数据集。