数据集:

DFKI-SLT/cross_ner

任务:

标记分类

子任务:

named-entity-recognition

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

extended|conll2003

预印本库:

arxiv:2012.04373

其他:

cross domain ai news cross+domain

数据集介绍文件清单

英文

CrossRE 数据集卡片

数据集摘要

CrossNER 是一个完全标注的命名实体识别（NER）数据集，涵盖了五个不同领域（政治、自然科学、音乐、文学和人工智能），每个领域都有特定的实体类别。此外，CrossNER 还包括对应五个领域的无标注相关语料库。

详情请参考论文： CrossNER: Evaluating Cross-Domain Named Entity Recognition

支持的任务和排行榜

More Information Needed

语言

CrossNER 中的语言数据为英语（BCP-47 en）

数据集结构

数据实例

conll2003

下载的数据集文件大小：2.69 MB
生成的数据集大小：5.26 MB

'train' 的示例如下所示：

{
  "id": "0", 
  "tokens": ["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."], 
  "ner_tags": [49, 0, 41, 0, 0, 0, 41, 0, 0]
}

politics

下载的数据集文件大小：0.72 MB
生成的数据集大小：1.04 MB

'train' 的示例如下所示：

{
  "id": "0", 
  "tokens": ["Parties", "with", "mainly", "Eurosceptic", "views", "are", "the", "ruling", "United", "Russia", ",", "and", "opposition", "parties", "the", "Communist", "Party", "of", "the", "Russian", "Federation", "and", "Liberal", "Democratic", "Party", "of", "Russia", "."], 
  "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 55, 56, 0, 0, 0, 0, 0, 55, 56, 56, 56, 56, 56, 0, 55, 56, 56, 56, 56, 0]
}

science

下载的数据集文件大小：0.49 MB
生成的数据集大小：0.73 MB

'train' 的示例如下所示：

{
  "id": "0", 
  "tokens": ["They", "may", "also", "use", "Adenosine", "triphosphate", ",", "Nitric", "oxide", ",", "and", "ROS", "for", "signaling", "in", "the", "same", "ways", "that", "animals", "do", "."], 
  "ner_tags": [0, 0, 0, 0, 15, 16, 0, 15, 16, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

music

下载的数据集文件大小：0.41 MB
生成的数据集大小：0.65 MB

'train' 的示例如下所示：

{
  "id": "0", 
  "tokens": ["In", "2003", ",", "the", "Stade", "de", "France", "was", "the", "primary", "site", "of", "the", "2003", "World", "Championships", "in", "Athletics", "."], 
  "ner_tags": [0, 0, 0, 0, 35, 36, 36, 0, 0, 0, 0, 0, 0, 29, 30, 30, 30, 30, 0]
}

literature

下载的数据集文件大小：0.33 MB
生成的数据集大小：0.58 MB

'train' 的示例如下所示：

{
  "id": "0",
  "tokens": ["In", "1351", ",", "during", "the", "reign", "of", "Emperor", "Toghon", "Temür", "of", "the", "Yuan", "dynasty", ",", "93rd-generation", "descendant", "Kong", "Huan", "(", "孔浣", ")", "'", "s", "2nd", "son", "Kong", "Shao", "(", "孔昭", ")", "moved", "from", "China", "to", "Korea", "during", "the", "Goryeo", ",", "and", "was", "received", "courteously", "by", "Princess", "Noguk", "(", "the", "Mongolian-born", "wife", "of", "the", "future", "king", "Gongmin", ")", "."], 
  "ner_tags": [0, 0, 0, 0, 0, 0, 0, 51, 52, 52, 0, 0, 21, 22, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 0, 0, 77, 78, 0, 77, 0, 0, 0, 21, 0, 21, 0, 0, 41, 0, 0, 0, 0, 0, 0, 51, 52, 0, 0, 41, 0, 0, 0, 0, 0, 51, 0, 0]
}

下载的数据集文件大小：0.29 MB
生成的数据集大小：0.48 MB

'train' 的示例如下所示：

{
  "id": "0", 
  "tokens": ["Popular", "approaches", "of", "opinion-based", "recommender", "system", "utilize", "various", "techniques", "including", "text", "mining", ",", "information", "retrieval", ",", "sentiment", "analysis", "(", "see", "also", "Multimodal", "sentiment", "analysis", ")", "and", "deep", "learning", "X.Y.", "Feng", ",", "H.", "Zhang", ",", "Y.J.", "Ren", ",", "P.H.", "Shang", ",", "Y.", "Zhu", ",", "Y.C.", "Liang", ",", "R.C.", "Guan", ",", "D.", "Xu", ",", "(", "2019", ")", ",", ",", "21", "(", "5", ")", ":", "e12957", "."], 
  "ner_tags": [0, 0, 0, 59, 60, 60, 0, 0, 0, 0, 31, 32, 0, 71, 72, 0, 71, 72, 0, 0, 0, 71, 72, 72, 0, 0, 31, 32, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 65, 66, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}

数据字段

所有拆分的数据字段相同。

id：该句子的实例 ID，一个字符串特征。
tokens：该句子的标记列表，一个字符串特征的列表。
ner_tags：实体标记列表，一个分类标签列表。

{"O": 0, "B-academicjournal": 1, "I-academicjournal": 2, "B-album": 3, "I-album": 4, "B-algorithm": 5, "I-algorithm": 6, "B-astronomicalobject": 7, "I-astronomicalobject": 8, "B-award": 9, "I-award": 10, "B-band": 11, "I-band": 12, "B-book": 13, "I-book": 14, "B-chemicalcompound": 15, "I-chemicalcompound": 16, "B-chemicalelement": 17, "I-chemicalelement": 18, "B-conference": 19, "I-conference": 20, "B-country": 21, "I-country": 22, "B-discipline": 23, "I-discipline": 24, "B-election": 25, "I-election": 26, "B-enzyme": 27, "I-enzyme": 28, "B-event": 29, "I-event": 30, "B-field": 31, "I-field": 32, "B-literarygenre": 33, "I-literarygenre": 34, "B-location": 35, "I-location": 36, "B-magazine": 37, "I-magazine": 38, "B-metrics": 39, "I-metrics": 40, "B-misc": 41, "I-misc": 42, "B-musicalartist": 43, "I-musicalartist": 44, "B-musicalinstrument": 45, "I-musicalinstrument": 46, "B-musicgenre": 47, "I-musicgenre": 48, "B-organisation": 49, "I-organisation": 50, "B-person": 51, "I-person": 52, "B-poem": 53, "I-poem": 54, "B-politicalparty": 55, "I-politicalparty": 56, "B-politician": 57, "I-politician": 58, "B-product": 59, "I-product": 60, "B-programlang": 61, "I-programlang": 62, "B-protein": 63, "I-protein": 64, "B-researcher": 65, "I-researcher": 66, "B-scientist": 67, "I-scientist": 68, "B-song": 69, "I-song": 70, "B-task": 71, "I-task": 72, "B-theory": 73, "I-theory": 74, "B-university": 75, "I-university": 76, "B-writer": 77, "I-writer": 78}

数据拆分

Train	Dev	Test
conll2003	14,987	3,466	3,684
politics	200	541	651
science	200	450	543
music	100	380	456
literature	100	400	416
ai	100	350	431

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

语言的来源制片人是谁？

More Information Needed

注释

注释过程

More Information Needed

注释者是谁？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

附加信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@article{liu2020crossner,
      title={CrossNER: Evaluating Cross-Domain Named Entity Recognition}, 
      author={Zihan Liu and Yan Xu and Tiezheng Yu and Wenliang Dai and Ziwei Ji and Samuel Cahyawijaya and Andrea Madotto and Pascale Fung},
      year={2020},
      eprint={2012.04373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

贡献者

感谢 @phucdev 添加此数据集。

作者:

DFKI-SLT

数据集大小:

37.96 KB

CrossRE 数据集卡片

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

附加信息

数据集策划者

许可信息

引用信息

贡献者