数据集:

gfissore/arxiv-abstracts-2021

任务:

摘要生成

文本检索

文生文

子任务:

explanation-generation text-simplification document-retrieval

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

expert-generated

批注创建人:

no-annotation

预印本库:

arxiv:1905.00075

许可:

cc0-1.0

数据集介绍文件清单

英文

arxiv-abstracts-2021 数据集卡片

数据集摘要

这是一个包含所有到2021年底为止（约200万篇论文）的arXiv文章的元数据数据集，其中包括标题和摘要。可能的应用包括趋势分析，论文推荐引擎，类别预测，知识图谱构建和语义搜索界面。

与 arxiv_dataset 相比，该数据集不包括2021年后提交到arXiv的论文，并且不需要外部下载。

支持的任务和排行榜

【需要更多信息】

语言

英语

数据集结构

数据实例

这是一个示例实例：

{  
  "id": "1706.03762",
  "submitter": "Ashish Vaswani",
  "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\n  Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin",
  "title": "Attention Is All You Need",
  "comments": "15 pages, 5 figures",
  "journal-ref": null,
  "doi": null,
  "abstract": "  The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural     
networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through 
an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention     
mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show   
these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to      
train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing   
best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, our model      
establishes a new single-model state-of-the-art\nBLEU score of 41.8 after training for 3.5 days on eight GPUs, a small 
fraction\nof the training costs of the best models from the literature. We show that the\nTransformer generalizes well 
to other tasks by applying it successfully to\nEnglish constituency parsing both with large and limited training       
data.\n",
  "report-no": null,
  "categories": [   
    "cs.CL cs.LG"
  ],     
  "versions": [  
    "v1",
    "v2",
    "v3",
    "v4",
    "v5"
  ]
}

数据字段

这些字段在 arXiv 中有详细说明：

id ：arXiv ID（可用于访问论文）
submitter ：提交论文的人
authors ：论文的作者
title ：论文的标题
comments ：附加信息，如页数和图表数量
journal-ref ：论文发表的期刊信息
doi ： Digital Object Identifier
report-no ：报告编号
abstract ：论文的摘要
categories ：在arXiv系统中的类别/标签

数据拆分

无拆分

数据集创建

创建理由

30多年来，arXiv通过提供学术文章的开放访问，从广泛的物理学领域到计算机科学的许多子学科，以及数学、统计学、电气工程、数量生物学和经济学等其他领域，为公众和研究界提供了服务。这个丰富的信息语料库提供了重要的知识深度，但有时也会感到压倒。在当前全球挑战的时代，高效地从数据中提取见解至关重要。arxiv-abstracts-2021 数据集旨在通过提供约200万篇论文的重要元数据（包括标题和摘要），使arXiv更容易用于机器学习应用。

数据来源

初始数据收集和规范化 [需要更多信息]

谁是来源语言的生产者？

这些语言的生产者是广大的科研社区成员，但不一定隶属于任何机构。

注释

注释过程

[N/A]

注释者是谁？

[N/A]

个人和敏感信息

数据集中包含论文作者的全名。

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集维护者

原始数据由 ArXiv 维护

授权信息

数据采用 Creative Commons CC0 1.0 Universal Public Domain Dedication 授权

引用信息

@misc{clement2019arxiv,
    title={On the Use of ArXiv as a Dataset},
    author={Colin B. Clement and Matthew Bierbaum and Kevin P. O'Keeffe and Alexander A. Alemi},
    year={2019},
    eprint={1905.00075},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}

作者:

gfissore

数据集大小:

896.38 MB