数据集:
gfissore/arxiv-abstracts-2021
语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
expert-generated批注创建人:
no-annotation预印本库:
arxiv:1905.00075许可:
cc0-1.0这是一个包含所有到2021年底为止(约200万篇论文)的arXiv文章的元数据数据集,其中包括标题和摘要。可能的应用包括趋势分析,论文推荐引擎,类别预测,知识图谱构建和语义搜索界面。
与 arxiv_dataset 相比,该数据集不包括2021年后提交到arXiv的论文,并且不需要外部下载。
【需要更多信息】
英语
这是一个示例实例:
{ "id": "1706.03762", "submitter": "Ashish Vaswani", "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\n Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin", "title": "Attention Is All You Need", "comments": "15 pages, 5 figures", "journal-ref": null, "doi": null, "abstract": " The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, our model establishes a new single-model state-of-the-art\nBLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction\nof the training costs of the best models from the literature. We show that the\nTransformer generalizes well to other tasks by applying it successfully to\nEnglish constituency parsing both with large and limited training data.\n", "report-no": null, "categories": [ "cs.CL cs.LG" ], "versions": [ "v1", "v2", "v3", "v4", "v5" ] }
这些字段在 arXiv 中有详细说明:
无拆分
30多年来,arXiv通过提供学术文章的开放访问,从广泛的物理学领域到计算机科学的许多子学科,以及数学、统计学、电气工程、数量生物学和经济学等其他领域,为公众和研究界提供了服务。这个丰富的信息语料库提供了重要的知识深度,但有时也会感到压倒。在当前全球挑战的时代,高效地从数据中提取见解至关重要。arxiv-abstracts-2021 数据集旨在通过提供约200万篇论文的重要元数据(包括标题和摘要),使arXiv更容易用于机器学习应用。
初始数据收集和规范化 [需要更多信息]
谁是来源语言的生产者?这些语言的生产者是广大的科研社区成员,但不一定隶属于任何机构。
[N/A]
注释者是谁?[N/A]
数据集中包含论文作者的全名。
[需要更多信息]
[需要更多信息]
[需要更多信息]
原始数据由 ArXiv 维护
数据采用 Creative Commons CC0 1.0 Universal Public Domain Dedication 授权
@misc{clement2019arxiv, title={On the Use of ArXiv as a Dataset}, author={Colin B. Clement and Matthew Bierbaum and Kevin P. O'Keeffe and Alexander A. Alemi}, year={2019}, eprint={1905.00075}, archivePrefix={arXiv}, primaryClass={cs.IR} }