数据集:
DFKI-SLT/cross_ner
任务:
语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
expert-generated源数据集:
extended|conll2003预印本库:
arxiv:2012.04373CrossNER 是一个完全标注的命名实体识别(NER)数据集,涵盖了五个不同领域(政治、自然科学、音乐、文学和人工智能),每个领域都有特定的实体类别。此外,CrossNER 还包括对应五个领域的无标注相关语料库。
详情请参考论文: CrossNER: Evaluating Cross-Domain Named Entity Recognition
CrossNER 中的语言数据为英语(BCP-47 en)
'train' 的示例如下所示:
{ "id": "0", "tokens": ["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."], "ner_tags": [49, 0, 41, 0, 0, 0, 41, 0, 0] }politics
'train' 的示例如下所示:
{ "id": "0", "tokens": ["Parties", "with", "mainly", "Eurosceptic", "views", "are", "the", "ruling", "United", "Russia", ",", "and", "opposition", "parties", "the", "Communist", "Party", "of", "the", "Russian", "Federation", "and", "Liberal", "Democratic", "Party", "of", "Russia", "."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 55, 56, 0, 0, 0, 0, 0, 55, 56, 56, 56, 56, 56, 0, 55, 56, 56, 56, 56, 0] }science
'train' 的示例如下所示:
{ "id": "0", "tokens": ["They", "may", "also", "use", "Adenosine", "triphosphate", ",", "Nitric", "oxide", ",", "and", "ROS", "for", "signaling", "in", "the", "same", "ways", "that", "animals", "do", "."], "ner_tags": [0, 0, 0, 0, 15, 16, 0, 15, 16, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }music
'train' 的示例如下所示:
{ "id": "0", "tokens": ["In", "2003", ",", "the", "Stade", "de", "France", "was", "the", "primary", "site", "of", "the", "2003", "World", "Championships", "in", "Athletics", "."], "ner_tags": [0, 0, 0, 0, 35, 36, 36, 0, 0, 0, 0, 0, 0, 29, 30, 30, 30, 30, 0] }literature
'train' 的示例如下所示:
{ "id": "0", "tokens": ["In", "1351", ",", "during", "the", "reign", "of", "Emperor", "Toghon", "Temür", "of", "the", "Yuan", "dynasty", ",", "93rd-generation", "descendant", "Kong", "Huan", "(", "孔浣", ")", "'", "s", "2nd", "son", "Kong", "Shao", "(", "孔昭", ")", "moved", "from", "China", "to", "Korea", "during", "the", "Goryeo", ",", "and", "was", "received", "courteously", "by", "Princess", "Noguk", "(", "the", "Mongolian-born", "wife", "of", "the", "future", "king", "Gongmin", ")", "."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 51, 52, 52, 0, 0, 21, 22, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 21, 0, 21, 0, 0, 41, 0, 0, 0, 0, 0, 0, 51, 52, 0, 0, 41, 0, 0, 0, 0, 0, 51, 0, 0] }ai
'train' 的示例如下所示:
{ "id": "0", "tokens": ["Popular", "approaches", "of", "opinion-based", "recommender", "system", "utilize", "various", "techniques", "including", "text", "mining", ",", "information", "retrieval", ",", "sentiment", "analysis", "(", "see", "also", "Multimodal", "sentiment", "analysis", ")", "and", "deep", "learning", "X.Y.", "Feng", ",", "H.", "Zhang", ",", "Y.J.", "Ren", ",", "P.H.", "Shang", ",", "Y.", "Zhu", ",", "Y.C.", "Liang", ",", "R.C.", "Guan", ",", "D.", "Xu", ",", "(", "2019", ")", ",", ",", "21", "(", "5", ")", ":", "e12957", "."], "ner_tags": [0, 0, 0, 59, 60, 60, 0, 0, 0, 0, 31, 32, 0, 71, 72, 0, 71, 72, 0, 0, 0, 71, 72, 72, 0, 0, 31, 32, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }
所有拆分的数据字段相同。
{"O": 0, "B-academicjournal": 1, "I-academicjournal": 2, "B-album": 3, "I-album": 4, "B-algorithm": 5, "I-algorithm": 6, "B-astronomicalobject": 7, "I-astronomicalobject": 8, "B-award": 9, "I-award": 10, "B-band": 11, "I-band": 12, "B-book": 13, "I-book": 14, "B-chemicalcompound": 15, "I-chemicalcompound": 16, "B-chemicalelement": 17, "I-chemicalelement": 18, "B-conference": 19, "I-conference": 20, "B-country": 21, "I-country": 22, "B-discipline": 23, "I-discipline": 24, "B-election": 25, "I-election": 26, "B-enzyme": 27, "I-enzyme": 28, "B-event": 29, "I-event": 30, "B-field": 31, "I-field": 32, "B-literarygenre": 33, "I-literarygenre": 34, "B-location": 35, "I-location": 36, "B-magazine": 37, "I-magazine": 38, "B-metrics": 39, "I-metrics": 40, "B-misc": 41, "I-misc": 42, "B-musicalartist": 43, "I-musicalartist": 44, "B-musicalinstrument": 45, "I-musicalinstrument": 46, "B-musicgenre": 47, "I-musicgenre": 48, "B-organisation": 49, "I-organisation": 50, "B-person": 51, "I-person": 52, "B-poem": 53, "I-poem": 54, "B-politicalparty": 55, "I-politicalparty": 56, "B-politician": 57, "I-politician": 58, "B-product": 59, "I-product": 60, "B-programlang": 61, "I-programlang": 62, "B-protein": 63, "I-protein": 64, "B-researcher": 65, "I-researcher": 66, "B-scientist": 67, "I-scientist": 68, "B-song": 69, "I-song": 70, "B-task": 71, "I-task": 72, "B-theory": 73, "I-theory": 74, "B-university": 75, "I-university": 76, "B-writer": 77, "I-writer": 78}
Train | Dev | Test | |
---|---|---|---|
conll2003 | 14,987 | 3,466 | 3,684 |
politics | 200 | 541 | 651 |
science | 200 | 450 | 543 |
music | 100 | 380 | 456 |
literature | 100 | 400 | 416 |
ai | 100 | 350 | 431 |
@article{liu2020crossner, title={CrossNER: Evaluating Cross-Domain Named Entity Recognition}, author={Zihan Liu and Yan Xu and Tiezheng Yu and Wenliang Dai and Ziwei Ji and Samuel Cahyawijaya and Andrea Madotto and Pascale Fung}, year={2020}, eprint={2012.04373}, archivePrefix={arXiv}, primaryClass={cs.CL} }
感谢 @phucdev 添加此数据集。