Babelscape/multinerd | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

Babelscape/multinerd

任务:

标记分类

子任务:

named-entity-recognition

语言:

计算机处理:

multilingual

语言创建人:

machine-generated

批注创建人:

machine-generated

源数据集:

original

其他:

structure-prediction

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

英文

MultiNERD数据集的数据卡片

描述

摘要：MultiNERD是第一个语言无关的方法，用于自动创建多语言、多类型和细粒度的命名实体识别（Named Entity Recognition，NER）和实体消歧（Entity Disambiguation）注释。具体而言，它可以看作是我们研究组的两项之前工作的扩展，这两项工作是 WikiNEuRal 和 NER4EL ，我们从中获得了最先进的银标注数据创建方法的灵感，并且从 NER4EL 中获得了细粒度类别和实体链接部分的灵感。生成的数据集涵盖了：10种语言（中文、荷兰文、英文、法文、德文、意大利文、波兰文、葡萄牙文、俄文和西班牙文），15个NER类别（人名（PER）、地点（LOC）、组织（ORG）、动物（ANIM）、生物实体（BIO）、天体（CEL）、疾病（DIS）、事件（EVE）、食物（FOOD）、仪器（INST）、媒体（MEDIA）、植物（PLANT）、神话实体（MYTH）、时间（TIME）和车辆（VEHI））以及2种文本类型（ Wikipedia 和 WikiNews ）；
存储库： https://github.com/Babelscape/multinerd
论文： https://aclanthology.org/multinerd
联系人：tedeschi@babelscape.com

数据集结构

所有的分割数据字段都是相同的。

tokens：一个字符串特征的列表。
ner_tags：一个分类标签的列表（int）。
lang：一个字符串特征。语言的完整列表：中文（zh），荷兰文（nl），英文（en），法文（fr），德文（de），意大利文（it），波兰文（pl），葡萄牙文（pt），俄文（ru），西班牙文（es）。
下面报告了完整的标签集及其索引：

{
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-ANIM": 7,
    "I-ANIM": 8,
    "B-BIO": 9,
    "I-BIO": 10,
    "B-CEL": 11,
    "I-CEL": 12,
    "B-DIS": 13,
    "I-DIS": 14,
    "B-EVE": 15,
    "I-EVE": 16,
    "B-FOOD": 17,
    "I-FOOD": 18,
    "B-INST": 19,
    "I-INST": 20,
    "B-MEDIA": 21,
    "I-MEDIA": 22,
    "B-MYTH": 23,
    "I-MYTH": 24,
    "B-PLANT": 25,
    "I-PLANT": 26,
    "B-TIME": 27,
    "I-TIME": 28,
    "B-VEHI": 29,
    "I-VEHI": 30,
  }

附加信息

许可信息：此存储库的内容仅限于非商业研究目的，受到 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0) 许可的限制。数据集内容的版权属于原始版权持有人。
引用信息：如果您使用了此存储库中的数据和/或代码，请考虑引用我们的工作。

@inproceedings{tedeschi-navigli-2022-multinerd,
    title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",
    author = "Tedeschi, Simone  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.60",
    doi = "10.18653/v1/2022.findings-naacl.60",
    pages = "801--812",
    abstract = "Named Entity Recognition (NER) is the task of identifying named entities in texts and classifying them through specific semantic categories, a process which is crucial for a wide range of NLP applications. Current datasets for NER focus mainly on coarse-grained entity types, tend to consider a single textual genre and to cover a narrow set of languages, thus limiting the general applicability of NER systems.In this work, we design a new methodology for automatically producing NER annotations, and address the aforementioned limitations by introducing a novel dataset that covers 10 languages, 15 NER categories and 2 textual genres.We also introduce a manually-annotated test set, and extensively evaluate the quality of our novel dataset on both this new test set and standard benchmarks for NER.In addition, in our dataset, we include: i) disambiguation information to enable the development of multilingual entity linking systems, and ii) image URLs to encourage the creation of multimodal systems.We release our dataset at https://github.com/Babelscape/multinerd.",
}

感谢 @sted97 为添加此数据集做出的贡献。

作者:

Babelscape

数据集大小:

517.07 MB