数据集:
DFKI-SLT/tacred
任务:
文本分类语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found源数据集:
extended|other预印本库:
arxiv:2104.08398许可:
otherTAC关系抽取数据集(TACRED)是一个大规模关系抽取数据集,包含106,264个示例,构建于每年TAC知识库填充(TAC KBP)挑战中使用的新闻稿和网络文本语料库之上。TACRED中的示例涵盖了TAC KBP挑战中使用的41种关系类型(例如,per:schools_attended和org:members),或者如果没有定义的关系,则标记为no_relation。这些示例是通过将TACKBP挑战的可用人工注释与众包相结合创建的。请参阅 Stanford's EMNLP paper ,或者他们的 EMNLP slides 以获取详细信息。
注意:
此存储库提供了数据集的所有三个版本作为BuilderConfigs-'original','revisited'和're-tacred'。只需在load_dataset方法中设置name参数,即可选择特定版本。默认情况下加载原始TACRED。
数据集中的语言为英语。
'train'的一个示例如下:
{ "id": "61b3a5c8c9a882dcfcd2", "docid": "AFP_ENG_20070218.0019.LDC2009T13", "relation": "org:founded_by", "token": ["Tom", "Thabane", "resigned", "in", "October", "last", "year", "to", "form", "the", "All", "Basotho", "Convention", "-LRB-", "ABC", "-RRB-", ",", "crossing", "the", "floor", "with", "17", "members", "of", "parliament", ",", "causing", "constitutional", "monarch", "King", "Letsie", "III", "to", "dissolve", "parliament", "and", "call", "the", "snap", "election", "."], "subj_start": 10, "subj_end": 13, "obj_start": 0, "obj_end": 2, "subj_type": "ORGANIZATION", "obj_type": "PERSON", "stanford_pos": ["NNP", "NNP", "VBD", "IN", "NNP", "JJ", "NN", "TO", "VB", "DT", "DT", "NNP", "NNP", "-LRB-", "NNP", "-RRB-", ",", "VBG", "DT", "NN", "IN", "CD", "NNS", "IN", "NN", ",", "VBG", "JJ", "NN", "NNP", "NNP", "NNP", "TO", "VB", "NN", "CC", "VB", "DT", "NN", "NN", "."], "stanford_ner": ["PERSON", "PERSON", "O", "O", "DATE", "DATE", "DATE", "O", "O", "O", "O", "O", "O", "O", "ORGANIZATION", "O", "O", "O", "O", "O", "O", "NUMBER", "O", "O", "O", "O", "O", "O", "O", "O", "PERSON", "PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"], "stanford_head": [2, 3, 0, 5, 3, 7, 3, 9, 3, 13, 13, 13, 9, 15, 13, 15, 3, 3, 20, 18, 23, 23, 18, 25, 23, 3, 3, 32, 32, 32, 32, 27, 34, 27, 34, 34, 34, 40, 40, 37, 3], "stanford_deprel": ["compound", "nsubj", "ROOT", "case", "nmod", "amod", "nmod:tmod", "mark", "xcomp", "det", "compound", "compound", "dobj", "punct", "appos", "punct", "punct", "xcomp", "det", "dobj", "case", "nummod", "nmod", "case", "nmod", "punct", "xcomp", "amod", "compound", "compound", "compound", "dobj", "mark", "xcomp", "dobj", "cc", "conj", "det", "compound", "dobj", "punct"] }
数据字段在所有拆分中都相同。
为了最小化数据集偏差,TACRED在TAC KBP挑战运行的年份之间进行了分层:
Train | Dev | Test | |
---|---|---|---|
TACRED | 68,124 (TAC KBP 2009-2012) | 22,631 (TAC KBP 2013) | 15,509 (TAC KBP 2014) |
Re-TACRED | 58,465 (TAC KBP 2009-2012) | 19,584 (TAC KBP 2013) | 13,418 (TAC KBP 2014) |
[需要更多信息]
初始数据收集和规范化
[需要更多信息]
源语言制片人是谁?[需要更多信息]
请参阅斯坦福论文和Tacred Revisited论文,以及它们的附录。
为了确保在TACRED上训练的模型不倾向于在现实世界的文本上预测假阳性,将所有找不到提及对之间关系的抽样句子完全标注为负面例子。因此,79.5%的示例标记为no_relation。
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
为了尊重TAC KBP语料库的版权,TACRED是通过Linguistic Data Consortium ( LDC License )发布的。您可以从 LDC TACRED webpage 下载TACRED。如果您是LDC会员,则可以免费访问;否则,需要支付25美元的访问费。
原始数据集:
@inproceedings{zhang2017tacred, author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.}, booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)}, title = {Position-aware Attention and Supervised Data Improve Slot Filling}, url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf}, pages = {35--45}, year = {2017} }
对于修订版本(“revisited”),还请引用:
@inproceedings{alt-etal-2020-tacred, title = "{TACRED} Revisited: A Thorough Evaluation of the {TACRED} Relation Extraction Task", author = "Alt, Christoph and Gabryszak, Aleksandra and Hennig, Leonhard", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.142", doi = "10.18653/v1/2020.acl-main.142", pages = "1558--1569", }
对于重新标记的版本(“re-tacred”),还请引用:
@inproceedings{DBLP:conf/aaai/StoicaPP21, author = {George Stoica and Emmanouil Antonios Platanios and Barnab{\'{a}}s P{\'{o}}czos}, title = {Re-TACRED: Addressing Shortcomings of the {TACRED} Dataset}, booktitle = {Thirty-Fifth {AAAI} Conference on Artificial Intelligence, {AAAI} 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, {IAAI} 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, {EAAI} 2021, Virtual Event, February 2-9, 2021}, pages = {13843--13850}, publisher = {{AAAI} Press}, year = {2021}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/17631}, }