数据集:
conll2012_ontonotesv5
任务:
标记分类计算机处理:
multilingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-nd-4.0OntoNotes v5.0是OntoNotes语料库的最终版本,是一个手动注释的大规模、多类型、多语言语料库,包含句法、语义和话语信息。
这个数据集是OntoNotes v5.0的扩展版本,用于CoNLL-2012共享任务。它包括英语/中文/阿拉伯语的v4训练/开发和v9测试数据,以及修正版本v12训练/开发/测试数据(仅英语)。
数据的来源是Mendeley Data repo ontonotes-conll2012 ,它似乎与官方数据相同,但用户使用此数据集需自行负责。
另请参阅paperwithcode OntoNotes 5.0 和 CoNLL-2012 的摘要。
对于数据集的更详细信息,如注释、标签集等,可以参考上述Mendeley repo中的文档。
阿拉伯语、中文、英语的V4数据,以及英语的V12数据
{ {'document_id': 'nw/wsj/23/wsj_2311', 'sentences': [{'part_id': 0, 'words': ['CONCORDE', 'trans-Atlantic', 'flights', 'are', '$', '2, 'to', 'Paris', 'and', '$', '3, 'to', 'London', '.']}, 'pos_tags': [25, 18, 27, 43, 2, 12, 17, 25, 11, 2, 12, 17, 25, 7], 'parse_tree': '(TOP(S(NP (NNP CONCORDE) (JJ trans-Atlantic) (NNS flights) )(VP (VBP are) (NP(NP(NP ($ $) (CD 2,400) )(PP (IN to) (NP (NNP Paris) ))) (CC and) (NP(NP ($ $) (CD 3,200) )(PP (IN to) (NP (NNP London) ))))) (. .) ))', 'predicate_lemmas': [None, None, None, 'be', None, None, None, None, None, None, None, None, None, None], 'predicate_framenet_ids': [None, None, None, '01', None, None, None, None, None, None, None, None, None, None], 'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None, None], 'speaker': None, 'named_entities': [7, 6, 0, 0, 0, 15, 0, 5, 0, 0, 15, 0, 5, 0], 'srl_frames': [{'frames': ['B-ARG1', 'I-ARG1', 'I-ARG1', 'B-V', 'B-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'I-ARG2', 'O'], 'verb': 'are'}], 'coref_spans': [], {'part_id': 0, 'words': ['In', 'a', 'Centennial', 'Journal', 'article', 'Oct.', '5', ',', 'the', 'fares', 'were', 'reversed', '.']}]} 'pos_tags': [17, 13, 25, 25, 24, 25, 12, 4, 13, 27, 40, 42, 7], 'parse_tree': '(TOP(S(PP (IN In) (NP (DT a) (NML (NNP Centennial) (NNP Journal) ) (NN article) ))(NP (NNP Oct.) (CD 5) ) (, ,) (NP (DT the) (NNS fares) )(VP (VBD were) (VP (VBN reversed) )) (. .) ))', 'predicate_lemmas': [None, None, None, None, None, None, None, None, None, None, None, 'reverse', None], 'predicate_framenet_ids': [None, None, None, None, None, None, None, None, None, None, None, '01', None], 'word_senses': [None, None, None, None, None, None, None, None, None, None, None, None, None], 'speaker': None, 'named_entities': [0, 0, 4, 22, 0, 12, 30, 0, 0, 0, 0, 0, 0], 'srl_frames': [{'frames': ['B-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'I-ARGM-LOC', 'B-ARGM-TMP', 'I-ARGM-TMP', 'O', 'B-ARG1', 'I-ARG1', 'O', 'B-V', 'O'], 'verb': 'reversed'}], 'coref_spans': [], }
sentences中的每个元素都是一个由以下数据字段组成的字典:
每个数据集(arabic_v4,chinese_v4,english_v4,english_v12)都有3个拆分: 训练、验证和测试
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@inproceedings{pradhan-etal-2013-towards, title = "Towards Robust Linguistic Analysis using {O}nto{N}otes", author = {Pradhan, Sameer and Moschitti, Alessandro and Xue, Nianwen and Ng, Hwee Tou and Bj{\"o}rkelund, Anders and Uryupina, Olga and Zhang, Yuchen and Zhong, Zhi}, booktitle = "Proceedings of the Seventeenth Conference on Computational Natural Language Learning", month = aug, year = "2013", address = "Sofia, Bulgaria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W13-3516", pages = "143--152", }
感谢 @richarddwang 添加了这个数据集。