数据集:

chcaa/DANSK

语言:

da
中文

Dataset Summary

DANSK: Danish Annotations for NLP Specific TasKs is a dataset consisting of texts from multiple domains, sampled from the Danish GigaWord Corpus (DAGW). The dataset was created to fill in the gap of Danish NLP datasets from different domains, that are required for training models that generalize across domains. The Named-Entity annotations are moreover fine-grained and have a similar form to that of OntoNotes v5, which significantly broadens the use cases of the dataset. The domains include Web, News, Wiki & Books, Legal, Dannet, Conversation and Social Media. For a more in-depth understanding of the domains, please refer to DAGW .

The distribution of texts and Named Entities within each domain can be seen in the table below:

Update log

  • 2023-05-26: Added individual annotations for each annotator to allow for analysis of inter-annotator agreement

Supported Tasks

The DANSK dataset currently only supports Named-Entity Recognition, but additional version releases will contain data for more tasks.

Languages

All texts in the dataset are in Danish. Slang from various platforms or dialects may appear, consistent with the domains from which the texts originally have been sampled - e.g. Social Media.

Dataset Structure

Data Instances

The JSON-formatted data is in the form seen below:

{
    "text": "Aborrer over 2 kg er en uhyre sj\u00e6lden fangst.",
    "ents": [{"start": 13, "end": 17, "label": "QUANTITY"}],
    "sents": [{"start": 0, "end": 45}],
    "tokens": [
        {"id": 0, "start": 0, "end": 7},
        {"id": 1, "start": 8, "end": 12},
        {"id": 2, "start": 13, "end": 14},
        {"id": 3, "start": 15, "end": 17},
        {"id": 4, "start": 18, "end": 20},
        {"id": 5, "start": 21, "end": 23},
        {"id": 6, "start": 24, "end": 29},
        {"id": 7, "start": 30, "end": 37},
        {"id": 8, "start": 38, "end": 44},
        {"id": 9, "start": 44, "end": 45},
    ],
    "spans": {"incorrect_spans": []},
    "dagw_source": "wiki",
    "dagw_domain": "Wiki & Books",
    "dagw_source_full": "Wikipedia",
}

Data Fields

  • text : The text
  • ents : The annotated entities
  • sents : The sentences of the text
  • dagw_source : Shorthand name of the source from which the text has been sampled in the Danish Gigaword Corpus
  • dagw_source_full : Full name of the source from which the text has been sampled in the Danish Gigaword Corpus
  • dagw_domain : Name of the domain to which the source adheres to

Data Splits

The data was randomly split up into three distinct partitions; train, dev, as well as a test partition. The splits come from the same pool, and there are thus no underlying differences between the sets. To see the distribution of named entities, and domains of the different partitions, please refer to the paper, or read the superficial statistics provided in the Dataset composition section of this markdown

Descriptive Statistics

Dataset Composition

Named entity annotation composition across partitions can be seen in the table below:

Full Train Validation Test
Texts 15062 12062 (80%) 1500 (10%) 1500 (10%)
Named entities 14462 11638 (80.47%) 1327 (9.18%) 1497 (10.25%)
CARDINAL 2069 1702 (82.26%) 168 (8.12%) 226 (10.92%)
DATE 1756 1411 (80.35%) 182 (10.36%) 163 (9.28%)
EVENT 211 175 (82.94%) 19 (9.00%) 17 (8.06%)
FACILITY 246 200 (81.30%) 25 (10.16%) 21 (8.54%)
GPE 1604 1276 (79.55%) 135 (8.42%) 193 (12.03%)
LANGUAGE 126 53 (42.06%) 17 (13.49%) 56 (44.44%)
LAW 183 148 (80.87%) 17 (9.29%) 18 (9.84%)
LOCATION 424 351 (82.78%) 46 (10.85%) 27 (6.37%)
MONEY 714 566 (79.27%) 72 (10.08%) 76 (10.64%)
NORP 495 405 (81.82%) 41 (8.28%) 49 (9.90%)
ORDINAL 127 105 (82.68%) 11 (8.66%) 11 (8.66%)
ORGANIZATION 2507 1960 (78.18%) 249 (9.93%) 298 (11.87%)
PERCENT 148 123 (83.11%) 13 (8.78%) 12 (8.11%)
PERSON 2133 1767 (82.84%) 191 (8.95%) 175 (8.20%)
PRODUCT 763 634 (83.09%) 57 (7.47%) 72 (9.44%)
QUANTITY 292 242 (82.88%) 28 (9.59%) 22 (7.53%)
TIME 218 185 (84.86%) 18 (8.26%) 15 (6.88%)
WORK OF ART 419 335 (79.95%) 38 (9.07%) 46 (10.98%)

Domain distribution

Domain and source distribution across partitions can be seen in the table below:

Domain Source Full Train Dev Test
Conversation Europa Parlamentet 206 173 17 16
Conversation Folketinget 23 21 1 1
Conversation NAAT 554 431 50 73
Conversation OpenSubtitles 377 300 39 38
Conversation Spontaneous speech 489 395 54 40
Dannet Dannet 25 18 4 3
Legal Retsinformation.dk 965 747 105 113
Legal Skat.dk 471 364 53 54
Legal Retspraktis 727 579 76 72
News DanAvis 283 236 20 27
News TV2R 138 110 16 12
Social Media hestenettet.dk 554 439 51 64
Web Common Crawl 8270 6661 826 783
Wiki & Books adl 640 517 57 66
Wiki & Books Wikipedia 279 208 30 41
Wiki & Books WikiBooks 335 265 36 34
Wiki & Books WikiSource 455 371 43 41

Entity Distribution across

Domain and named entity distributions for the training set can be seen below:

All domains combined Conversation Dannet Legal News Social Media Web Wiki and Books
DOCS 12062 1320 18 1690 346 439 6661 1361
ENTS 11638 1060 15 1292 419 270 7502 883
CARDINAL 1702 346 6 95 35 17 1144 59
DATE 1411 113 5 257 40 29 831 126
EVENT 175 43 0 1 9 3 106 8
FACILITY 200 2 0 4 18 3 159 10
GPE 1276 130 2 60 68 31 846 128
LANGUAGE 53 3 0 0 0 0 34 16
LAW 148 10 0 100 1 0 22 13
LOCATION 351 18 0 1 7 7 288 29
MONEY 566 1 0 62 13 18 472 0
NORP 405 70 0 61 22 1 188 42
ORDINAL 105 11 0 17 9 2 43 22
ORGANIZATION 1960 87 0 400 61 39 1303 58
PERCENT 123 5 0 10 11 0 91 4
PERSON 1767 189 2 194 101 69 970 121
PRODUCT 634 3 0 10 2 33 581 3
QUANTITY 242 1 0 9 6 17 188 20
TIME 185 16 0 5 13 1 144 6
WORK OF ART 335 12 0 6 3 0 92 218

Domain and named entity distributions for the validation set can be seen below:

Sum Conversation Dannet Legal News Social Media Web Wiki
DOCS 1500 161 4 234 36 51 826 166
ENTS 1497 110 4 171 43 30 983 143
CARDINAL 226 41 2 19 7 5 139 13
DATE 163 11 0 27 6 4 89 26
EVENT 17 2 0 0 1 0 13 1
FACILITY 21 1 0 0 0 0 16 4
GPE 193 17 1 8 7 2 131 25
LANGUAGE 56 0 0 0 0 0 50 6
LAW 18 2 0 8 0 0 8 0
LOCATION 27 2 0 1 0 0 21 3
MONEY 76 2 0 9 1 6 58 0
NORP 49 8 0 8 1 2 21 9
ORDINAL 11 2 0 2 0 1 3 3
ORGANIZATION 298 6 0 68 5 3 212 4
PERCENT 12 0 0 2 0 0 10 0
PERSON 175 16 1 16 11 4 96 20
PRODUCT 72 0 0 0 0 2 69 1
QUANTITY 22 0 0 1 2 1 17 1
TIME 15 0 0 0 2 0 13 0
WORK OF ART 46 0 0 2 0 0 17 27

Domain and named entity distributions for the testing set can be seen below:

Sum Conversation Dannet Legal News Social Media Web Wiki
DOCS 1500 161 4 234 36 51 826 166
ENTS 1497 110 4 171 43 30 983 143
CARDINAL 226 41 2 19 7 5 139 13
DATE 163 11 0 27 6 4 89 26
EVENT 17 2 0 0 1 0 13 1
FACILITY 21 1 0 0 0 0 16 4
GPE 193 17 1 8 7 2 131 25
LANGUAGE 56 0 0 0 0 0 50 6
LAW 18 2 0 8 0 0 8 0
LOCATION 27 2 0 1 0 0 21 3
MONEY 76 2 0 9 1 6 58 0
NORP 49 8 0 8 1 2 21 9
ORDINAL 11 2 0 2 0 1 3 3
ORGANIZATION 298 6 0 68 5 3 212 4
PERCENT 12 0 0 2 0 0 10 0
PERSON 175 16 1 16 11 4 96 20
PRODUCT 72 0 0 0 0 2 69 1
QUANTITY 22 0 0 1 2 1 17 1
TIME 15 0 0 0 2 0 13 0
WORK OF ART 46 0 0 2 0 0 17 27

Dataset Creation

Curation Rationale

The dataset is meant to fill in the gap of Danish NLP that up until now has been missing a dataset with 1) fine-grained named entity recognition labels, and 2) high variance in domain origin of texts. As such, it is the intention that DANSK should be employed in training by anyone who wishes to create models for NER that are both generalizable across domains and fine-grained in their predictions. It may also be utilized to assess across-domain evaluations, in order to unfold any potential domain biases. While the dataset currently only entails annotations for named entities, it is the intention that future versions of the dataset will feature dependency Parsing, pos tagging, and possibly revised NER annotations.

Source Data

The data collection, annotation, and normalization steps of the data were extensive. As the description is too long for this readme, please refer to the associated paper upon its publication for a full description.

Initial Data Collection and Normalization

Annotations

Annotation process

To afford high granularity, the DANSK dataset utilized the annotation standard of OntoNotes 5.0. The standard features 18 different named entity types. The full description can be seen in the associated paper.

Who are the annotators?

10 English Linguistics Master’s program students from Aarhus University were employed. They worked 10 hours/week for six weeks from October 11, 2021, to November 22, 2021. Their annotation tasks included part-of-speech tagging, dependency parsing, and NER annotation. Named entity annotations and dependency parsing was done from scratch, while the POS tagging consisted of corrections of silver-standard predictions by an NLP model.

Annotator Compensation

10 English Linguistics Master’s program students from Aarhus University were employed. They worked 10 hours/week for six weeks from October 11, 2021, to November 22, 2021. Their annotation tasks included part-of-speech tagging, dependency parsing, and NER annotation. Annotators were compensated at the standard rate for students, as determined by the collective agreement of the Danish Ministry of Finance and the Central Organization of Teachers and the CO10 Central Organization of 2010 (the CO10 joint agreement), which is 140DKK/hour. Named entity annotations and dependency parsing was done from scratch, while the POS tagging consisted of corrections of predictions by an NLP model.

Automatic correction

During the manual correction of the annotation a series of consistent errors were found. These were corrected using the following Regex patterns (see also the Danish Addendum to the Ontonotes annotation guidelines):

Regex Patterns

For matching with TIME spans, e.g. [16:30 - 17:30] (TIME):

\d{1,2}:\d\d ?[-|\||\/] ?\d
dag: \d{1,2}

For matching with DATE spans, e.g. [1938 - 1992] (DATE):

\d{2,4} ?[-|–] ?\d{2,4}

For matching companies with A/S og ApS,

e.g. [Hansens Skomager A/S] (ORGANIZATION):
ApS
A\/S

For matching written numerals, e.g. "en":

to | to$|^to| To | To$|^To| TO | TO$|^TO|
tre | tre$|^tre| Tre | Tre$|^Tre| TRE | TRE$|^TRE|
fire | fire$|^fire| Fire | Fire$|^Fire| FIRE | FIRE$|^FIRE|
fem | fem$|^fem| Fem | Fem$|^Fem| FEM | FEM$|^FEM|
seks | seks$|^seks| Seks | Seks$|^Seks| SEKS | SEKS$|
^SYV|
otte | otte$|^otte| Otte | Otte$|^Otte| OTTE | OTTE$|^OTTE|
ni | ni$|^ni| Ni | Ni$|^Ni| NI | NI$|^NI|
ti | ti$|^ti| Ti | Ti$|^Ti| TI | TI$|^TI

For matching "Himlen" or "Himmelen" already annotated as LOCATION, e.g. "HIMLEN":

[Hh][iI][mM][lL][Ee][Nn]|[Hh][iI][mM][mM][Ee][lL][Ee][Nn]

For matching "Gud" already tagged as PERSON, e.g. "GUD":

[Gg][Uu][Dd]

For matching telephone numbers wrongly already tagged as CARDINAL, e.g. "20 40 44 30":

\d{2} \d{2} \d{2} \d{2}
\+\d{2} \d{2} ?\d{2} ?\d{2} ?\d{2}$
\+\d{2} \d{2} ?\d{2} ?\d{2} ?\d{2}$
 \d{4} ?\d{4}$
^\d{4} ?\d{4}$

For matching websites already wrongly tagged as ORGANIZATION:

.dk$|.com$

For matching Hotels and Resorts already wrongly tagged as ORGANIZATION:

.*[h|H]otel.*|.*[R|r]esort.*

For matching numbers including / or :, already wrongly tagged as CARDINAL:

\/
\/
 
-

For matching rights already wrongly tagged as LAW:

[C|c]opyright
[®|©]
[f|F]ortrydelsesret
[o|O]phavsret$
enneskeret

Licensing Information

Creative Commons Attribution-Share Alike 4.0 International license

Citation Information

The paper is in progress.