数据集:

strombergnlp/offenseval_2020

中文

Dataset Card for "offenseval_2020"

Dataset Summary

OffensEval 2020 features a multilingual dataset with five languages. The languages included in OffensEval 2020 are:

  • Arabic
  • Danish
  • English
  • Greek
  • Turkish

The annotation follows the hierarchical tagset proposed in the Offensive Language Identification Dataset (OLID) and used in OffensEval 2019. In this taxonomy we break down offensive content into the following three sub-tasks taking the type and target of offensive content into account. The following sub-tasks were organized:

  • Sub-task A - Offensive language identification;
  • Sub-task B - Automatic categorization of offense types;
  • Sub-task C - Offense target identification.

English training data is omitted so needs to be collected otherwise (see https://zenodo.org/record/3950379#.XxZ-aFVKipp )

The source datasets come from:

Supported Tasks and Leaderboards

Languages

Five are covered: bcp47 ar;da;en;gr;tr

Dataset Structure

There are five named configs, one per language:

  • ar Arabic
  • da Danish
  • en English
  • gr Greek
  • tr Turkish

The training data for English is absent - this is 9M tweets that need to be rehydrated on their own. See https://zenodo.org/record/3950379#.XxZ-aFVKipp

Data Instances

An example of 'train' looks as follows.

{
  'id': '0', 
  'text': 'PLACEHOLDER TEXT', 
  'subtask_a': 1, 
}

Data Fields

  • id : a string feature.
  • text : a string .
  • subtask_a : whether or not the instance is offensive; 0: NOT, 1: OFF

Data Splits

name train test
ar 7839 1827
da 2961 329
en 0 3887
gr 8743 1544
tr 31277 3515

Dataset Creation

Curation Rationale

Collecting data for abusive language classification. Different rational for each dataset.

Source Data

Initial Data Collection and Normalization

Varies per language dataset

Who are the source language producers?

Social media users

Annotations

Annotation process

Varies per language dataset

Who are the annotators?

Varies per language dataset; native speakers

Personal and Sensitive Information

The data was public at the time of collection. No PII removal has been performed.

Considerations for Using the Data

Social Impact of Dataset

The data definitely contains abusive language. The data could be used to develop and propagate offensive language against every target group involved, i.e. ableism, racism, sexism, ageism, and so on.

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

The datasets is curated by each sub-part's paper authors.

Licensing Information

This data is available and distributed under Creative Commons attribution license, CC-BY 4.0.

Citation Information

@inproceedings{zampieri-etal-2020-semeval,
    title = "{S}em{E}val-2020 Task 12: Multilingual Offensive Language Identification in Social Media ({O}ffens{E}val 2020)",
    author = {Zampieri, Marcos  and
      Nakov, Preslav  and
      Rosenthal, Sara  and
      Atanasova, Pepa  and
      Karadzhov, Georgi  and
      Mubarak, Hamdy  and
      Derczynski, Leon  and
      Pitenis, Zeses  and
      {\c{C}}{\"o}ltekin, {\c{C}}a{\u{g}}r{\i}},
    booktitle = "Proceedings of the Fourteenth Workshop on Semantic Evaluation",
    month = dec,
    year = "2020",
    address = "Barcelona (online)",
    publisher = "International Committee for Computational Linguistics",
    url = "https://aclanthology.org/2020.semeval-1.188",
    doi = "10.18653/v1/2020.semeval-1.188",
    pages = "1425--1447",
    abstract = "We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.",
}

Contributions

Author-added dataset @leondz