数据集:
allenai/scifact
任务:
文本分类子任务:
fact-checking语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-2.0SciFact是一个包含1.4K条专家编写的科学论断和包含证据的摘要的数据集,并且标注了标签和解释。
'验证'的示例如下所示。
{ "cited_doc_ids": [14717500], "claim": "1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.", "evidence_doc_id": "14717500", "evidence_label": "SUPPORT", "evidence_sentences": [2, 5], "id": 3 }语料库
'训练'的示例如下所示。
This example was too long and was cropped: { "abstract": "[\"Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and res...", "doc_id": 4983, "structured": false, "title": "Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging." }
数据字段在所有拆分中相同。
论断train | validation | test | |
---|---|---|---|
claims | 1261 | 450 | 300 |
train | |
---|---|
corpus | 5183 |
https://github.com/allenai/scifact/blob/master/LICENSE.md
SciFact数据集在 CC BY-NC 2.0 下发布。使用SciFact数据,即表示同意其使用条款。
@inproceedings{wadden-etal-2020-fact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550", }
感谢 @thomwolf , @lhoestq , @dwadden , @patrickvonplaten , @mariamabarham , @lewtun 添加此数据集。