数据集:

tomaarsen/conllpp

任务:

标记分类

子任务:

named-entity-recognition

语言:

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

extended|conll2003

许可:

license:unknown

数据集介绍文件清单

英文

"conllpp"的数据集卡片

数据集概要

CoNLLpp是CoNLL2003 NER数据集的修正版本，其中测试集中5.38%的句子标签已经经过手动修正。为了完整性，该数据集还包括CoNLL2003的训练集和开发集。例如，对测试集的一个修正如下所示：

{
    "tokens": ["SOCCER", "-", "JAPAN", "GET", "LUCKY", "WIN", ",", "CHINA", "IN", "SURPRISE", "DEFEAT", "."],
    "original_ner_tags_in_conll2003": ["O", "O", "B-LOC", "O", "O", "O", "O", "B-PER", "O", "O", "O", "O"],
    "corrected_ner_tags_in_conllpp": ["O", "O", "B-LOC", "O", "O", "O", "O", "B-LOC", "O", "O", "O", "O"],
}

支持的任务和排行榜

[需要更多信息]

语言

[需要更多信息]

数据集结构

数据实例

conllpp

下载的数据集文件大小：4.85 MB
生成的数据集大小：10.26 MB
使用的总磁盘空间：15.11 MB

'train'的一个示例如下所示。

This example was too long and was cropped:

{
    "id": "0",
    "document_id": 1,
    "sentence_id": 3,
    "tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers", "to", "shun", "British", "lamb", "until", "scientists", "determine", "whether", "mad", "cow", "disease", "can", "be", "transmitted", "to", "sheep", "."]
    "pos_tags": [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7],
    "ner_tags": [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "chunk_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
}

数据字段

所有拆分的数据字段相同。

conllpp

id：一个字符串特征。
document_id：一个int32特征，用于跟踪样本来自哪个文档。
sentence_id：一个int32特征，用于跟踪该样本在文档中的哪个句子。
tokens：一组字符串特征列表。
pos_tags：一组包含以下可能值的分类标签，包括"（0）,"（1）,"#"（2）,"$"（3）,"（"（4）。
chunk_tags：一组包含以下可能值的分类标签，包括O（0）, B-ADJP（1）, I-ADJP（2）, B-ADVP（3）, I-ADVP（4）。
ner_tags：一组包含以下可能值的分类标签，包括O（0）, B-PER（1）, I-PER（2）, B-ORG（3）, I-ORG（4）。

数据拆分

name	train	validation	test
conll2003	14041	3250	3453

数据集创建

策划理由

[需要更多信息]

源数据

数据的初始收集和规范化

[需要更多信息]

谁是源语言的制作人？

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是标注者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集维护者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

@inproceedings{wang2019crossweigh,
  title={CrossWeigh: Training Named Entity Tagger from Imperfect Annotations},
  author={Wang, Zihan and Shang, Jingbo and Liu, Liyuan and Lu, Lihao and Liu, Jiacheng and Han, Jiawei},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={5157--5166},
  year={2019}
}

贡献者

感谢 @ZihanWangKi 添加了该数据集。

作者:

tomaarsen

数据集大小:

22.29 KB