数据集:

arxiv_dataset

任务:

翻译

摘要生成

文本检索

子任务:

document-retrieval entity-linking-retrieval explanation-generation

语言:

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

expert-generated

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1905.00075

许可:

cc0-1.0

数据集介绍文件清单

英文

arXiv数据集的数据卡

数据集摘要

这是一个包含170万篇arXiv文章的数据集，可用于趋势分析、论文推荐引擎、类别预测、共引网络、知识图构建和语义搜索界面等应用。

支持的任务和排行榜

[需要更多信息]

语言

支持的语言为英语

数据集结构

数据示例

这个数据集是原始ArXiv数据的镜像。由于完整的数据集相当大（1.1TB并在不断增长），所以这个数据集只提供了一个json格式的元数据文件。下面是一个示例

{'id': '0704.0002',
 'submitter': 'Louis Theran',
 'authors': 'Ileana Streinu and Louis Theran',
 'title': 'Sparsity-certifying Graph Decompositions',
 'comments': 'To appear in Graphs and Combinatorics',
 'journal-ref': None,
 'doi': None,
 'report-no': None,
 'categories': 'math.CO cs.CG',
 'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',
 'abstract': '  We describe a new algorithm, the $(k,\\ell)$-pebble game with colors, and use\nit obtain a characterization of the family of $(k,\\ell)$-sparse graphs and\nalgorithmic solutions to a family of problems concerning tree decompositions of\ngraphs. Special instances of sparse graphs appear in rigidity theory and have\nreceived increased attention in recent years. In particular, our colored\npebbles generalize and strengthen the previous results of Lee and Streinu and\ngive a new proof of the Tutte-Nash-Williams characterization of arboricity. We\nalso present a new decomposition that certifies sparsity based on the\n$(k,\\ell)$-pebble game with colors. Our work also exposes connections between\npebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and\nWestermann and Hendrickson.\n',
 'update_date': '2008-12-13'}

数据字段

id: ArXiv ID（可用于访问论文）
submitter: 提交论文的人
authors: 论文的作者
title: 论文的标题
comments: 其他信息，如页数和图表数量
journal-ref: 论文发表的期刊信息
doi: Digital Object Identifier
report-no: 报告编号
abstract: 论文的摘要
categories: ArXiv系统中的类别/标签

数据拆分

数据未经过拆分。

数据集创建

策划理由

30年来，ArXiv通过提供学术文章的开放访问，为公众和研究界提供了服务。从广泛的物理学领域到计算机科学的许多子学科，再到数学、统计学、电气工程、数量生物学和经济学等领域的各种信息，这个丰富的文献集合提供了重要的、但有时难以应付的深度。在这个独特的全球挑战时期，从数据中高效提取洞察力是至关重要的。为了帮助使arXiv更加可访问，这里提供了一个免费、开放的Kaggle arXiv数据集的机器可读版本：一个包含170万篇文章的数据仓库，包括相关特征，如文章标题、作者、类别、摘要、全文PDF等等，以赋予新的用例能力，这些用例可以导向结合多模态特征的更丰富的机器学习技术探索，用于趋势分析、论文推荐引擎、类别预测、共引网络、知识图构建和语义搜索界面等应用。

数据来源

这些数据基于arXiv的论文。[需要更多信息]

初始数据收集和规范化

[需要更多信息]

谁是源语言的生产者？

[需要更多信息]

注释

这个数据集不包含注释。

注释过程

[需要更多信息]

谁是注释者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据时的注意事项

数据集的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

原始数据由 ArXiv 维护

授权信息

数据使用 Creative Commons CC0 1.0 Universal Public Domain Dedication

引用信息

@misc{clement2019arxiv,
    title={On the Use of ArXiv as a Dataset},
    author={Colin B. Clement and Matthew Bierbaum and Kevin P. O'Keeffe and Alexander A. Alemi},
    year={2019},
    eprint={1905.00075},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}

贡献者

感谢 @tanmoyio 添加了这个数据集。

作者:

佚名

数据集大小:

14.58 KB