英文

DaNE数据集数据卡

数据集摘要

DaNE(Danish Dependency Treebank)是使用CoNLL-2003注释方案对丹麦通用依存树库进行命名实体注释。

丹麦UD树库(Johannsen等人,2015年,UD-DDT)是丹麦依存树库(Buch-Kromann等人,2003年)的转换,基于Parole(Britt,1998年)的文本。UD-DDT具有依存解析和词性标注的注释。该数据集由Alexandra研究所在DaNE数据集(Hvingelby等人,2020年)中使用命名实体对PER、ORG和LOC进行了注释。

支持的任务和排行榜

词性标注、依存解析和命名实体识别。

语言

丹麦语

数据集结构

数据实例

这是“train”数据集中的一个示例:

{
  'sent_id': 'train-v2-0\n', 
  'lemmas': ['på', 'fredag', 'have', 'SiD', 'invitere', 'til', 'reception', 'i', 'SID-hus', 'i', 'anledning', 'af', 'at', 'formand', 'Kjeld', 'Christensen', 'gå', 'ind', 'i', 'den', 'glad', 'tresser', '.'],
  'dep_labels': [35, 16, 28, 33, 19, 35, 16, 35, 18, 35, 18, 1, 1, 33, 22, 12, 32, 11, 35, 10, 30, 16, 34],
  'ner_tags': [0, 0, 0, 3, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0],
  'morph_tags': ['AdpType=Prep', 'Definite=Ind|Gender=Com|Number=Sing', 'Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act', '_', 'Definite=Ind|Number=Sing|Tense=Past|VerbForm=Part', 'AdpType=Prep', 'Definite=Ind|Gender=Com|Number=Sing', 'AdpType=Prep', 'Definite=Def|Gender=Neut|Number=Sing', 'AdpType=Prep', 'Definite=Ind|Gender=Com|Number=Sing', 'AdpType=Prep', '_', 'Definite=Def|Gender=Com|Number=Sing', '_', '_', 'Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act', '_', 'AdpType=Prep', 'Number=Plur|PronType=Dem', 'Degree=Pos|Number=Plur', 'Definite=Ind|Gender=Com|Number=Plur', '_'], 
  'dep_ids': [2, 5, 5, 5, 0, 7, 5, 9, 7, 11, 7, 17, 17, 17, 14, 15, 11, 17, 22, 22, 22, 18, 5], 
  'pos_tags': [11, 12, 5, 7, 3, 11, 12, 11, 12, 11, 12, 11, 16, 12, 7, 7, 3, 9, 11, 14, 6, 12, 10], 
  'text': 'På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere.\n', 
  'tokens': ['På', 'fredag', 'har', 'SID', 'inviteret', 'til', 'reception', 'i', 'SID-huset', 'i', 'anledning', 'af', 'at', 'formanden', 'Kjeld', 'Christensen', 'går', 'ind', 'i', 'de', 'glade', 'tressere', '.'], 
  'tok_ids': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
}

数据字段

数据字段:

  • q_id:每个示例的字符串问题标识符,对应于其在Pushshift.io Reddit提交转储中的ID。
  • subreddit:explainlikeimfive、askscience或AskHistorians中的一个,指示问题来自哪个子版块
  • title:问题的标题,提取并用URL_n标记替换的URL
  • title_urls:提取的URL列表,第n个元素被URL_n替换
  • sent_id:每个示例的字符串标识符
  • text:字符串,原始句子(未分词)
  • tok_ids:id列表(整数),每个标记一个
  • tokens:字符串列表,标记
  • lemmas:字符串列表,标记的词形
  • pos_tags:字符串列表,标记的词性标签
  • morph_tags:字符串列表,标记的形态标签
  • dep_ids:id列表(整数),每个标记的传入依赖的头的id
  • dep_labels:字符串列表,依赖标签
  • ner_tags:字符串列表,命名实体标签(BIO格式)

数据拆分

train validation test
# sentences 4383 564 565
# tokens 80 378 10 322 10 023

创建数据集

策划理由

[需要更多信息]

源数据

初始数据收集和归一化

[需要更多信息]

谁是源语言的生产者?

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的考虑事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

[需要更多信息]

引用信息

@inproceedings{hvingelby-etal-2020-dane,
    title = "{D}a{NE}: A Named Entity Resource for {D}anish",
    author = "Hvingelby, Rasmus  and
      Pauli, Amalie Brogaard  and
      Barrett, Maria  and
      Rosted, Christina  and
      Lidegaard, Lasse Malm  and
      S{\o}gaard, Anders",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.565",
    pages = "4597--4604",
    abstract = "We present a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme: DaNE. It is the largest publicly available, Danish named entity gold annotation. We evaluate the quality of our annotations intrinsically by double annotating the entire treebank and extrinsically by comparing our annotations to a recently released named entity annotation of the validation and test sections of the Danish Universal Dependencies treebank. We benchmark the new resource by training and evaluating competitive architectures for supervised named entity recognition (NER), including FLAIR, monolingual (Danish) BERT and multilingual BERT. We explore cross-lingual transfer in multilingual BERT from five related languages in zero-shot and direct transfer setups, and we show that even with our modestly-sized training set, we improve Danish NER over a recent cross-lingual approach, as well as over zero-shot transfer from five related languages. Using multilingual BERT, we achieve higher performance by fine-tuning on both DaNE and a larger Bokm{\aa}l (Norwegian) training set compared to only using DaNE. However, the highest performance isachieved by using a Danish BERT fine-tuned on DaNE. Our dataset enables improvements and applicability for Danish NER beyond cross-lingual methods. We employ a thorough error analysis of the predictions of the best models for seen and unseen entities, as well as their robustness on un-capitalized text. The annotated dataset and all the trained models are made publicly available.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

贡献者

感谢 @ophelielacroix , @lhoestq 添加此数据集。