数据集:
dane
任务:
标记分类语言:
da计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated许可:
cc-by-sa-4.0DaNE(Danish Dependency Treebank)是使用CoNLL-2003注释方案对丹麦通用依存树库进行命名实体注释。
丹麦UD树库(Johannsen等人,2015年,UD-DDT)是丹麦依存树库(Buch-Kromann等人,2003年)的转换,基于Parole(Britt,1998年)的文本。UD-DDT具有依存解析和词性标注的注释。该数据集由Alexandra研究所在DaNE数据集(Hvingelby等人,2020年)中使用命名实体对PER、ORG和LOC进行了注释。
词性标注、依存解析和命名实体识别。
丹麦语
这是“train”数据集中的一个示例:
{ 'sent_id': 'train-v2-0\n', 'lemmas': ['på', 'fredag', 'have', 'SiD', 'invitere', 'til', 'reception', 'i', 'SID-hus', 'i', 'anledning', 'af', 'at', 'formand', 'Kjeld', 'Christensen', 'gå', 'ind', 'i', 'den', 'glad', 'tresser', '.'], 'dep_labels': [35, 16, 28, 33, 19, 35, 16, 35, 18, 35, 18, 1, 1, 33, 22, 12, 32, 11, 35, 10, 30, 16, 34], 'ner_tags': [0, 0, 0, 3, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0], 'morph_tags': ['AdpType=Prep', 'Definite=Ind|Gender=Com|Number=Sing', 'Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act', '_', 'Definite=Ind|Number=Sing|Tense=Past|VerbForm=Part', 'AdpType=Prep', 'Definite=Ind|Gender=Com|Number=Sing', 'AdpType=Prep', 'Definite=Def|Gender=Neut|Number=Sing', 'AdpType=Prep', 'Definite=Ind|Gender=Com|Number=Sing', 'AdpType=Prep', '_', 'Definite=Def|Gender=Com|Number=Sing', '_', '_', 'Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act', '_', 'AdpType=Prep', 'Number=Plur|PronType=Dem', 'Degree=Pos|Number=Plur', 'Definite=Ind|Gender=Com|Number=Plur', '_'], 'dep_ids': [2, 5, 5, 5, 0, 7, 5, 9, 7, 11, 7, 17, 17, 17, 14, 15, 11, 17, 22, 22, 22, 18, 5], 'pos_tags': [11, 12, 5, 7, 3, 11, 12, 11, 12, 11, 12, 11, 16, 12, 7, 7, 3, 9, 11, 14, 6, 12, 10], 'text': 'På fredag har SID inviteret til reception i SID-huset i anledning af at formanden Kjeld Christensen går ind i de glade tressere.\n', 'tokens': ['På', 'fredag', 'har', 'SID', 'inviteret', 'til', 'reception', 'i', 'SID-huset', 'i', 'anledning', 'af', 'at', 'formanden', 'Kjeld', 'Christensen', 'går', 'ind', 'i', 'de', 'glade', 'tressere', '.'], 'tok_ids': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] }
数据字段:
train | validation | test | |
---|---|---|---|
# sentences | 4383 | 564 | 565 |
# tokens | 80 378 | 10 322 | 10 023 |
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@inproceedings{hvingelby-etal-2020-dane, title = "{D}a{NE}: A Named Entity Resource for {D}anish", author = "Hvingelby, Rasmus and Pauli, Amalie Brogaard and Barrett, Maria and Rosted, Christina and Lidegaard, Lasse Malm and S{\o}gaard, Anders", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.565", pages = "4597--4604", abstract = "We present a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme: DaNE. It is the largest publicly available, Danish named entity gold annotation. We evaluate the quality of our annotations intrinsically by double annotating the entire treebank and extrinsically by comparing our annotations to a recently released named entity annotation of the validation and test sections of the Danish Universal Dependencies treebank. We benchmark the new resource by training and evaluating competitive architectures for supervised named entity recognition (NER), including FLAIR, monolingual (Danish) BERT and multilingual BERT. We explore cross-lingual transfer in multilingual BERT from five related languages in zero-shot and direct transfer setups, and we show that even with our modestly-sized training set, we improve Danish NER over a recent cross-lingual approach, as well as over zero-shot transfer from five related languages. Using multilingual BERT, we achieve higher performance by fine-tuning on both DaNE and a larger Bokm{\aa}l (Norwegian) training set compared to only using DaNE. However, the highest performance isachieved by using a Danish BERT fine-tuned on DaNE. Our dataset enables improvements and applicability for Danish NER beyond cross-lingual methods. We employ a thorough error analysis of the predictions of the best models for seen and unseen entities, as well as their robustness on un-capitalized text. The annotated dataset and all the trained models are made publicly available.", language = "English", ISBN = "979-10-95546-34-4", }
感谢 @ophelielacroix , @lhoestq 添加此数据集。