数据集:

xed_en_fi

任务:

文本分类

子任务:

intent-classification multi-class-classification multi-label-classification

语言:

计算机处理:

multilingual

大小:

10K<n<100K 1K<n<10K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

extended|other-OpenSubtitles2016

预印本库:

arxiv:2011.01612

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for xed_english_finnish

Dataset Summary

This is the XED dataset. The dataset consists of emotion annotated movie subtitles from OPUS. We use Plutchik's 8 core emotions to annotate. The data is multilabel. The original annotations have been sourced for mainly English and Finnish. For the English data we used Stanford NER (named entity recognition) (Finkel et al., 2005) to replace names and locations with the tags: [PERSON] and [LOCATION] respectively. For the Finnish data, we replaced names and locations using the Turku NER corpus (Luoma et al., 2020).

Supported Tasks and Leaderboards

Sentiment Classification, Multilabel Classification, Multilabel Classification, Intent Classification

Languages

English, Finnish

Dataset Structure

Data Instances

{ "sentence": "A confession that you hired [PERSON] ... and are responsible for my father's murder."
   "labels": [1, 6]  # anger, sadness
}

Data Fields

sentence: a line from the dataset
labels: labels corresponding to the emotion as an integer

Where the number indicates the emotion in ascending alphabetical order: anger:1, anticipation:2, disgust:3, fear:4, joy:5, sadness:6, surprise:7, trust:8, with neutral:0 where applicable.

Data Splits

For English: Number of unique data points: 17528 ('en_annotated' config) + 9675 ('en_neutral' config) Number of emotions: 8 (+neutral)

For Finnish: Number of unique data points: 14449 ('fi_annotated' config) + 10794 ('fi_neutral' config) Number of emotions: 8 (+neutral)

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

License: Creative Commons Attribution 4.0 International License (CC-BY)

Citation Information

@inproceedings{ohman2020xed, title={XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection}, author={{"O}hman, Emily and P{`a}mies, Marc and Kajava, Kaisla and Tiedemann, J{"o}rg}, booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)}, year={2020} }

Contributions

Thanks to @lhoestq , @harshalmittal4 for adding this dataset.

作者:

佚名

数据集大小:

23.01 KB