数据集:

Djacon/ru_goemotions

中文

Dataset Card for GoEmotions

Dataset Summary

The RuGoEmotions dataset contains 34k Reddit comments labeled for 9 emotion categories (joy, interest, surprice, sadness, anger, disgust, fear, guilt and neutral). The dataset already with predefined train/val/test splits

Supported Tasks and Leaderboards

This dataset is intended for multi-class, multi-label emotion classification.

Languages

The data is in Russian.

Dataset Structure

Data Instances

Each instance is a reddit comment with one or more emotion annotations (or neutral).

Data Fields

The configuration includes:

  • text : the reddit comment
  • labels : the emotion annotations

Data Splits

The simplified data includes a set of train/val/test splits with 26.9k, 3.29k, and 3.37k examples respectively.

Dataset Creation

Curation Rationale

From the paper abstract:

Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks.

Source Data

Initial Data Collection and Normalization

Data was collected from Reddit comments via a variety of automated methods discussed in 3.1 of the paper.

Who are the source language producers?

English-speaking Reddit users.

Annotations

Who are the annotators?

Annotations were produced by 3 English-speaking crowdworkers in India.

Personal and Sensitive Information

This dataset includes the original usernames of the Reddit users who posted each comment. Although Reddit usernames are typically disasociated from personal real-world identities, this is not always the case. It may therefore be possible to discover the identities of the individuals who created this content in some cases.

Considerations for Using the Data

Social Impact of Dataset

Emotion detection is a worthwhile problem which can potentially lead to improvements such as better human/computer interaction. However, emotion detection algorithms (particularly in computer vision) have been abused in some cases to make erroneous inferences in human monitoring and assessment applications such as hiring decisions, insurance pricing, and student attentiveness (see this article ).

Discussion of Biases

From the authors' github page:

Potential biases in the data include: Inherent biases in Reddit and user base biases, the offensive/vulgar word lists used for data filtering, inherent or unconscious bias in assessment of offensive identity labels, annotators were all native English speakers from India. All these likely affect labelling, precision, and recall for a trained model. Anyone using this dataset should be aware of these limitations of the dataset.

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Researchers at Amazon Alexa, Google Research, and Stanford. See the author list .

Licensing Information

The GitHub repository which houses this dataset has an Apache License 2.0 .

Citation Information

@inproceedings{demszky2020goemotions, author = {Demszky, Dorottya and Movshovitz-Attias, Dana and Ko, Jeongwoo and Cowen, Alan and Nemade, Gaurav and Ravi, Sujith}, booktitle = {58th Annual Meeting of the Association for Computational Linguistics (ACL)}, title = {{GoEmotions: A Dataset of Fine-Grained Emotions}}, year = {2020} }

Contributions

Thanks to @joeddav for adding this dataset.