数据集:

Babelscape/wikineural

英文

WikiNEuRal数据集的数据卡片

描述

  • 概述:WikiNEuRal基于多语言词汇知识库(即 BabelNet )和基于Transformer的架构(即 BERT ),提出了一种新颖的技术,用于生成多语言NER的高质量注释。在NER的常用基准测试上,与最先进的替代数据生成方法相比,它显示出高达6个基于跨度的F1分数的一贯改进。我们使用这种方法自动生成了9种语言的NER训练数据。
  • 仓库: https://github.com/Babelscape/wikineural
  • 论文: https://aclanthology.org/wikineural
  • 联系人:tedeschi@babelscape.com

数据集结构

所有分割的数据字段都是相同的。

  • tokens:字符串特征的列表。
  • ner_tags:分类标签列表(int)。完整的标签集和索引:
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
  • lang:字符串特征。支持的语言列表:Dutch(荷兰语,nl),English(英语,en),French(法语,fr),German(德语,de),Italian(意大利语,it),Polish(波兰语,pl),Portugues(葡萄牙语,pt),Russian(俄语,ru),Spanish(西班牙语,es)。

数据集统计

下表显示了每种语言的句子数量、标记数量和每个类别的实例数量。

Dataset Version Sentences Tokens PER ORG LOC MISC OTHER
WikiNEuRal EN 116k 2.73M 51k 31k 67k 45k 2.40M
WikiNEuRal ES 95k 2.33M 43k 17k 68k 25k 2.04M
WikiNEuRal NL 107k 1.91M 46k 22k 61k 24k 1.64M
WikiNEuRal DE 124k 2.19M 60k 32k 59k 25k 1.87M
WikiNEuRal RU 123k 2.39M 40k 26k 89k 25k 2.13M
WikiNEuRal IT 111k 2.99M 67k 22k 97k 26k 2.62M
WikiNEuRal FR 127k 3.24M 76k 25k 101k 29k 2.83M
WikiNEuRal PL 141k 2.29M 59k 34k 118k 22k 1.91M
WikiNEuRal PT 106k 2.53M 44k 17k 112k 25k 2.20M

其他信息

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}
  • 贡献:感谢 @sted97 添加了这个数据集。