数据集:

DFKI-SLT/tacred

英文

"tacred" 数据集卡片

数据集摘要

TAC关系抽取数据集(TACRED)是一个大规模关系抽取数据集,包含106,264个示例,构建于每年TAC知识库填充(TAC KBP)挑战中使用的新闻稿和网络文本语料库之上。TACRED中的示例涵盖了TAC KBP挑战中使用的41种关系类型(例如,per:schools_attended和org:members),或者如果没有定义的关系,则标记为no_relation。这些示例是通过将TACKBP挑战的可用人工注释与众包相结合创建的。请参阅 Stanford's EMNLP paper ,或者他们的 EMNLP slides 以获取详细信息。

注意:

  • 目前有一个 TACRED 数据集的改进版本,你应该考虑使用它,而不是在2017年发布的原始版本。有关这个新版本的更多详细信息,请参见2020年ACL上发布的 TACRED Revisited paper
  • 还有一个 TACRED 数据集的转录版本。有关这个新版本的更多详细信息,请参见2020年ACL上发布的 Re-TACRED paper

此存储库提供了数据集的所有三个版本作为BuilderConfigs-'original','revisited'和're-tacred'。只需在load_dataset方法中设置name参数,即可选择特定版本。默认情况下加载原始TACRED。

支持的任务和排行榜

语言

数据集中的语言为英语。

数据集结构

数据实例

  • 下载的数据集文件大小:62.3 MB
  • 生成的数据集大小:139.2 MB
  • 总计使用的磁盘空间:201.5 MB

'train'的一个示例如下:

{
  "id": "61b3a5c8c9a882dcfcd2",
  "docid": "AFP_ENG_20070218.0019.LDC2009T13",
  "relation": "org:founded_by",
  "token": ["Tom", "Thabane", "resigned", "in", "October", "last", "year", "to", "form", "the", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", ",", "crossing", "the", "floor", "with", "17", "members", "of", "parliament", ",", "causing", "constitutional", "monarch", "King", "Letsie", "III", "to", "dissolve", "parliament", "and", "call", "the", "snap", "election", "."],
  "subj_start": 10,
  "subj_end": 13,
  "obj_start": 0,
  "obj_end": 2,
  "subj_type": "ORGANIZATION",
  "obj_type": "PERSON",
  "stanford_pos": ["NNP", "NNP", "VBD", "IN", "NNP", "JJ", "NN", "TO", "VB", "DT", "DT", "NNP", "NNP", "-LRB-", "NNP", "-RRB-", ",", "VBG", "DT", "NN", "IN", "CD", "NNS", "IN", "NN", ",", "VBG", "JJ", "NN", "NNP", "NNP", "NNP", "TO", "VB", "NN", "CC", "VB", "DT", "NN", "NN", "."],
  "stanford_ner": ["PERSON", "PERSON", "O", "O", "DATE", "DATE", "DATE", "O", "O", "O", "O", "O", "O", "O", "ORGANIZATION", "O", "O", "O", "O", "O", "O", "NUMBER", "O", "O", "O", "O", "O", "O", "O", "O", "PERSON", "PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
  "stanford_head": [2, 3, 0, 5, 3, 7, 3, 9, 3, 13, 13, 13, 9, 15, 13, 15, 3, 3, 20, 18, 23, 23, 18, 25, 23, 3, 3, 32, 32, 32, 32, 27, 34, 27, 34, 34, 34, 40, 40, 37, 3],
  "stanford_deprel": ["compound", "nsubj", "ROOT", "case", "nmod", "amod", "nmod:tmod", "mark", "xcomp", "det", "compound", "compound", "dobj", "punct", "appos", "punct", "punct", "xcomp", "det", "dobj", "case", "nummod", "nmod", "case", "nmod", "punct", "xcomp", "amod", "compound", "compound", "compound", "dobj", "mark", "xcomp", "dobj", "cc", "conj", "det", "compound", "dobj", "punct"]
}

数据字段

数据字段在所有拆分中都相同。

  • id:该句子的实例ID,为字符串特征。
  • docid:该句子的TAC KBP文档ID,为字符串特征。
  • token:该句子的令牌列表,通过StanfordNLP工具包获取,为字符串特征的列表。
  • relation:该实例的关系标签,为字符串分类标签。
  • subj_start:关系主语提及的起始令牌的基于0的索引,为整数特征。
  • subj_end:关系主语提及的终止令牌的基于0的索引(不包括),为整数特征。
  • subj_type:主语提及的NER类型,包含在 Stanford NER system 中使用的23种细粒度类型,为字符串特征。
  • obj_start:关系宾语提及的起始令牌的基于0的索引,为整数特征。
  • obj_end:关系宾语提及的终止令牌的基于0的索引(不包括),为整数特征。
  • obj_type:宾语提及的NER类型,包含在 Stanford NER system 中使用的23种细粒度类型,为字符串特征。
  • stanford_pos:每个令牌的词性标记,主题提及的NER类型中使用的23种细粒度类型,为字符串特征的列表。
  • stanford_ner:令牌的NER标签(IO-Scheme),主题提及的NER类型中使用的23种细粒度类型,为字符串特征的列表。
  • stanford_deprel:每个令牌的Stanford依赖关系标记,为字符串特征的列表。
  • stanford_head:每个令牌的依赖关系的头(源)令牌索引(从0开始)。根令牌的头索引为-1,为整数特征的列表。

数据拆分

为了最小化数据集偏差,TACRED在TAC KBP挑战运行的年份之间进行了分层:

Train Dev Test
TACRED 68,124 (TAC KBP 2009-2012) 22,631 (TAC KBP 2013) 15,509 (TAC KBP 2014)
Re-TACRED 58,465 (TAC KBP 2009-2012) 19,584 (TAC KBP 2013) 13,418 (TAC KBP 2014)

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

源语言制片人是谁?

[需要更多信息]

注释

注释过程

请参阅斯坦福论文和Tacred Revisited论文,以及它们的附录。

为了确保在TACRED上训练的模型不倾向于在现实世界的文本上预测假阳性,将所有找不到提及对之间关系的抽样句子完全标注为负面例子。因此,79.5%的示例标记为no_relation。

注释者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

数据使用注意事项

数据集的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

为了尊重TAC KBP语料库的版权,TACRED是通过Linguistic Data Consortium ( LDC License )发布的。您可以从 LDC TACRED webpage 下载TACRED。如果您是LDC会员,则可以免费访问;否则,需要支付25美元的访问费。

引用信息

原始数据集:

@inproceedings{zhang2017tacred,
  author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)},
  title = {Position-aware Attention and Supervised Data Improve Slot Filling},
  url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf},
  pages = {35--45},
  year = {2017}
}

对于修订版本(“revisited”),还请引用:

@inproceedings{alt-etal-2020-tacred,
    title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task",
    author = "Alt, Christoph  and
      Gabryszak, Aleksandra  and
      Hennig, Leonhard",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.142",
    doi = "10.18653/v1/2020.acl-main.142",
    pages = "1558--1569",
}

对于重新标记的版本(“re-tacred”),还请引用:

@inproceedings{DBLP:conf/aaai/StoicaPP21,
  author       = {George Stoica and
                  Emmanouil Antonios Platanios and
                  Barnab{\'{a}}s P{\'{o}}czos},
  title        = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset},
  booktitle    = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI}
                  2021, Thirty-Third Conference on Innovative Applications of Artificial
                  Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances
                  in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9,
                  2021},
  pages        = {13843--13850},
  publisher    = {{AAAI} Press},
  year         = {2021},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/17631},
}

贡献

感谢 @dfki-nlp @phucdev 添加了这个数据集。