数据集:

Babelscape/multinerd

英文

MultiNERD数据集的数据卡片

描述

  • 摘要:MultiNERD是第一个语言无关的方法,用于自动创建多语言、多类型和细粒度的命名实体识别(Named Entity Recognition,NER)和实体消歧(Entity Disambiguation)注释。具体而言,它可以看作是我们研究组的两项之前工作的扩展,这两项工作是 WikiNEuRal NER4EL ,我们从中获得了最先进的银标注数据创建方法的灵感,并且从 NER4EL 中获得了细粒度类别和实体链接部分的灵感。生成的数据集涵盖了:10种语言(中文、荷兰文、英文、法文、德文、意大利文、波兰文、葡萄牙文、俄文和西班牙文),15个NER类别(人名(PER)、地点(LOC)、组织(ORG)、动物(ANIM)、生物实体(BIO)、天体(CEL)、疾病(DIS)、事件(EVE)、食物(FOOD)、仪器(INST)、媒体(MEDIA)、植物(PLANT)、神话实体(MYTH)、时间(TIME)和车辆(VEHI))以及2种文本类型( Wikipedia WikiNews );
  • 存储库: https://github.com/Babelscape/multinerd
  • 论文: https://aclanthology.org/multinerd
  • 联系人:tedeschi@babelscape.com

数据集结构

所有的分割数据字段都是相同的。

  • tokens:一个字符串特征的列表。
  • ner_tags:一个分类标签的列表(int)。
  • lang:一个字符串特征。语言的完整列表:中文(zh),荷兰文(nl),英文(en),法文(fr),德文(de),意大利文(it),波兰文(pl),葡萄牙文(pt),俄文(ru),西班牙文(es)。
  • 下面报告了完整的标签集及其索引:
{
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-ANIM": 7,
    "I-ANIM": 8,
    "B-BIO": 9,
    "I-BIO": 10,
    "B-CEL": 11,
    "I-CEL": 12,
    "B-DIS": 13,
    "I-DIS": 14,
    "B-EVE": 15,
    "I-EVE": 16,
    "B-FOOD": 17,
    "I-FOOD": 18,
    "B-INST": 19,
    "I-INST": 20,
    "B-MEDIA": 21,
    "I-MEDIA": 22,
    "B-MYTH": 23,
    "I-MYTH": 24,
    "B-PLANT": 25,
    "I-PLANT": 26,
    "B-TIME": 27,
    "I-TIME": 28,
    "B-VEHI": 29,
    "I-VEHI": 30,
  }

附加信息

@inproceedings{tedeschi-navigli-2022-multinerd,
    title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",
    author = "Tedeschi, Simone  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.60",
    doi = "10.18653/v1/2022.findings-naacl.60",
    pages = "801--812",
    abstract = "Named Entity Recognition (NER) is the task of identifying named entities in texts and classifying them through specific semantic categories, a process which is crucial for a wide range of NLP applications. Current datasets for NER focus mainly on coarse-grained entity types, tend to consider a single textual genre and to cover a narrow set of languages, thus limiting the general applicability of NER systems.In this work, we design a new methodology for automatically producing NER annotations, and address the aforementioned limitations by introducing a novel dataset that covers 10 languages, 15 NER categories and 2 textual genres.We also introduce a manually-annotated test set, and extensively evaluate the quality of our novel dataset on both this new test set and standard benchmarks for NER.In addition, in our dataset, we include: i) disambiguation information to enable the development of multilingual entity linking systems, and ii) image URLs to encourage the creation of multimodal systems.We release our dataset at https://github.com/Babelscape/multinerd.",
}
  • 感谢 @sted97 为添加此数据集做出的贡献。