数据集:
allenai/cord19
预印本库:
arxiv:2004.07180源数据集:
original批注创建人:
no-annotation语言创建人:
found大小:
100K<n<1M计算机处理:
monolingual语言:
enCORD-19是有关COVID-19和相关冠状病毒研究的学术论文语料库。由艾伦人工智能研究所的Semantic Scholar团队策划和维护,以支持文本挖掘和自然语言处理研究。
请参见相关的 Kaggle challenge 中定义的任务。
该数据集是英文(en)。
以下代码块以类似json的语法(缩写,因为某些字段很长)呈现了样本的概述:
{ "abstract": "OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified [...]", "authors": "Madani, Tariq A; Al-Ghamdi, Aisha A", "cord_uid": "ug7v899j", "doc_embeddings": [ -2.939983606338501, -6.312200546264648, -1.0459030866622925, [...] 766 values in total [...] -4.107113361358643, -3.8174145221710205, 1.8976187705993652, 5.811529159545898, -2.9323840141296387 ], "doi": "10.1186/1471-2334-1-6", "journal": "BMC Infect Dis", "publish_time": "2001-07-04", "sha": "d1aafb70c066a2068b02786f8929fd9c900897fb", "source_x": "PMC", "title": "Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia", "url": "https: //www.ncbi.nlm.nih.gov/pmc/articles/PMC35282/" }
当前只集成了以下字段:cord_uid, sha, source_x, title, doi, abstract, publish_time, authors, journal。通过使用fulltext配置,将pdf_json_files中转录的部分转换为fulltext特征。
基于所选的加载配置的额外字段:
该数据集没有提供注释,因此所有实例都在训练集中提供。
各个配置的大小如下:
train | |
---|---|
metadata | 368618 |
fulltext | 368618 |
embeddings | 368618 |
请参见 official readme 。
请参见 official readme 。
初始数据收集和规范化
请参见 official readme 。
源语言生产者是谁?请参见 official readme 。
没有注释。
注释过程N/A
注释者是谁?N/A
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@article{Wang2020CORD19TC, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} }
感谢 @ggdupont 添加此数据集。