数据集:

DFKI-SLT/cross_ner

英文

CrossRE 数据集卡片

数据集摘要

CrossNER 是一个完全标注的命名实体识别(NER)数据集,涵盖了五个不同领域(政治、自然科学、音乐、文学和人工智能),每个领域都有特定的实体类别。此外,CrossNER 还包括对应五个领域的无标注相关语料库。

详情请参考论文: CrossNER: Evaluating Cross-Domain Named Entity Recognition

支持的任务和排行榜

More Information Needed

语言

CrossNER 中的语言数据为英语(BCP-47 en)

数据集结构

数据实例

conll2003
  • 下载的数据集文件大小:2.69 MB
  • 生成的数据集大小:5.26 MB

'train' 的示例如下所示:

{
  "id": "0", 
  "tokens": ["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."], 
  "ner_tags": [49, 0, 41, 0, 0, 0, 41, 0, 0]
}
politics
  • 下载的数据集文件大小:0.72 MB
  • 生成的数据集大小:1.04 MB

'train' 的示例如下所示:

{
  "id": "0", 
  "tokens": ["Parties", "with", "mainly", "Eurosceptic", "views", "are", "the", "ruling", "United", "Russia", ",", "and", "opposition", "parties", "the", "Communist", "Party", "of", "the", "Russian", "Federation", "and", "Liberal", "Democratic", "Party", "of", "Russia", "."], 
  "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 55, 56, 0, 0, 0, 0, 0, 55, 56, 56, 56, 56, 56, 0, 55, 56, 56, 56, 56, 0]
}
science
  • 下载的数据集文件大小:0.49 MB
  • 生成的数据集大小:0.73 MB

'train' 的示例如下所示:

{
  "id": "0", 
  "tokens": ["They", "may", "also", "use", "Adenosine", "triphosphate", ",", "Nitric", "oxide", ",", "and", "ROS", "for", "signaling", "in", "the", "same", "ways", "that", "animals", "do", "."], 
  "ner_tags": [0, 0, 0, 0, 15, 16, 0, 15, 16, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
music
  • 下载的数据集文件大小:0.41 MB
  • 生成的数据集大小:0.65 MB

'train' 的示例如下所示:

{
  "id": "0", 
  "tokens": ["In", "2003", ",", "the", "Stade", "de", "France", "was", "the", "primary", "site", "of", "the", "2003", "World", "Championships", "in", "Athletics", "."], 
  "ner_tags": [0, 0, 0, 0, 35, 36, 36, 0, 0, 0, 0, 0, 0, 29, 30, 30, 30, 30, 0]
}
literature
  • 下载的数据集文件大小:0.33 MB
  • 生成的数据集大小:0.58 MB

'train' 的示例如下所示:

{
  "id": "0",
  "tokens": ["In", "1351", ",", "during", "the", "reign", "of", "Emperor", "Toghon", "Temür", "of", "the", "Yuan", "dynasty", ",", "93rd-generation", "descendant", "Kong", "Huan", "(", "孔浣", ")", "'", "s", "2nd", "son", "Kong", "Shao", "(", "孔昭", ")", "moved", "from", "China", "to", "Korea", "during", "the", "Goryeo", ",", "and", "was", "received", "courteously", "by", "Princess", "Noguk", "(", "the", "Mongolian-born", "wife", "of", "the", "future", "king", "Gongmin", ")", "."], 
  "ner_tags": [0, 0, 0, 0, 0, 0, 0, 51, 52, 52, 0, 0, 21, 22, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 21, 0, 21, 0, 0, 41, 0, 0, 0, 0, 0, 0, 51, 52, 0, 0, 41, 0, 0, 0, 0, 0, 51, 0, 0]
}
ai
  • 下载的数据集文件大小:0.29 MB
  • 生成的数据集大小:0.48 MB

'train' 的示例如下所示:

{
  "id": "0", 
  "tokens": ["Popular", "approaches", "of", "opinion-based", "recommender", "system", "utilize", "various", "techniques", "including", "text", "mining", ",", "information", "retrieval", ",", "sentiment", "analysis", "(", "see", "also", "Multimodal", "sentiment", "analysis", ")", "and", "deep", "learning", "X.Y.", "Feng", ",", "H.", "Zhang", ",", "Y.J.", "Ren", ",", "P.H.", "Shang", ",", "Y.", "Zhu", ",", "Y.C.", "Liang", ",", "R.C.", "Guan", ",", "D.", "Xu", ",", "(", "2019", ")", ",", ",", "21", "(", "5", ")", ":", "e12957", "."], 
  "ner_tags": [0, 0, 0, 59, 60, 60, 0, 0, 0, 0, 31, 32, 0, 71, 72, 0, 71, 72, 0, 0, 0, 71, 72, 72, 0, 0, 31, 32, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

数据字段

所有拆分的数据字段相同。

  • id:该句子的实例 ID,一个字符串特征。
  • tokens:该句子的标记列表,一个字符串特征的列表。
  • ner_tags:实体标记列表,一个分类标签列表。
{"O": 0, "B-academicjournal": 1, "I-academicjournal": 2, "B-album": 3, "I-album": 4, "B-algorithm": 5, "I-algorithm": 6, "B-astronomicalobject": 7, "I-astronomicalobject": 8, "B-award": 9, "I-award": 10, "B-band": 11, "I-band": 12, "B-book": 13, "I-book": 14, "B-chemicalcompound": 15, "I-chemicalcompound": 16, "B-chemicalelement": 17, "I-chemicalelement": 18, "B-conference": 19, "I-conference": 20, "B-country": 21, "I-country": 22, "B-discipline": 23, "I-discipline": 24, "B-election": 25, "I-election": 26, "B-enzyme": 27, "I-enzyme": 28, "B-event": 29, "I-event": 30, "B-field": 31, "I-field": 32, "B-literarygenre": 33, "I-literarygenre": 34, "B-location": 35, "I-location": 36, "B-magazine": 37, "I-magazine": 38, "B-metrics": 39, "I-metrics": 40, "B-misc": 41, "I-misc": 42, "B-musicalartist": 43, "I-musicalartist": 44, "B-musicalinstrument": 45, "I-musicalinstrument": 46, "B-musicgenre": 47, "I-musicgenre": 48, "B-organisation": 49, "I-organisation": 50, "B-person": 51, "I-person": 52, "B-poem": 53, "I-poem": 54, "B-politicalparty": 55, "I-politicalparty": 56, "B-politician": 57, "I-politician": 58, "B-product": 59, "I-product": 60, "B-programlang": 61, "I-programlang": 62, "B-protein": 63, "I-protein": 64, "B-researcher": 65, "I-researcher": 66, "B-scientist": 67, "I-scientist": 68, "B-song": 69, "I-song": 70, "B-task": 71, "I-task": 72, "B-theory": 73, "I-theory": 74, "B-university": 75, "I-university": 76, "B-writer": 77, "I-writer": 78}

数据拆分

Train Dev Test
conll2003 14,987 3,466 3,684
politics 200 541 651
science 200 450 543
music 100 380 456
literature 100 400 416
ai 100 350 431

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

语言的来源制片人是谁?

More Information Needed

注释

注释过程

More Information Needed

注释者是谁?

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

数据的社会影响

More Information Needed

偏见讨论

More Information Needed

其他已知限制

More Information Needed

附加信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@article{liu2020crossner,
      title={CrossNER: Evaluating Cross-Domain Named Entity Recognition}, 
      author={Zihan Liu and Yan Xu and Tiezheng Yu and Wenliang Dai and Ziwei Ji and Samuel Cahyawijaya and Andrea Madotto and Pascale Fung},
      year={2020},
      eprint={2012.04373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

贡献者

感谢 @phucdev 添加此数据集。