数据集:
docred
任务:
文本检索语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
crowdsourced批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1906.06127许可:
mit文档中的多个实体通常会展示复杂的跨句子关系,现有的关系抽取(RE)方法往往只专注于提取单个实体对的句内关系,无法很好地处理这种情况。为了推动文档级关系抽取研究,我们引入了 DocRED 数据集,该数据集是从维基百科和维基数据构建的,具有以下三个特点: - DocRED 对命名实体和关系都进行了注释,并且是基于纯文本进行文档级关系抽取的最大人工标注数据集。 - DocRED 需要读取文档中多个句子来提取实体并推断它们的关系,通过综合文档的所有信息。 - 除了人工标注的数据,我们还提供了规模庞大的远程监督数据,使 DocRED 能够在监督和弱监督的场景中都能够应用。
'train_annotated' 的一个示例如下所示。
{ "labels": { "evidence": [[0]], "head": [0], "relation_id": ["P1"], "relation_text": ["is_a"], "tail": [0] }, "sents": [["This", "is", "a", "sentence"], ["This", "is", "another", "sentence"]], "title": "Title of the document", "vertexSet": [[{ "name": "sentence", "pos": [3], "sent_id": 0, "type": "NN" }, { "name": "sentence", "pos": [3], "sent_id": 1, "type": "NN" }], [{ "name": "This", "pos": [0], "sent_id": 0, "type": "NN" }]] }
所有拆分的数据字段相同。
defaultname | train_annotated | train_distant | validation | test |
---|---|---|---|---|
default | 3053 | 101873 | 998 | 1000 |
@inproceedings{yao-etal-2019-docred, title = "{D}oc{RED}: A Large-Scale Document-Level Relation Extraction Dataset", author = "Yao, Yuan and Ye, Deming and Li, Peng and Han, Xu and Lin, Yankai and Liu, Zhenghao and Liu, Zhiyuan and Huang, Lixin and Zhou, Jie and Sun, Maosong", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P19-1074", doi = "10.18653/v1/P19-1074", pages = "764--777", }
感谢 @ghomasHudson 、 @thomwolf 、 @lhoestq 添加了该数据集。