数据集:

LennardZuendorf/interpretor

语言:

en

大小:

10K<n<100K

预印本库:

arxiv:2012.15761

许可:

mit
中文

Dataset Card for Dataset Name

This is an edit of original work from this paper , which I have uploaded to Huggingface here . It is not my original work, I just edited it. Data is used in the similarly named Interpretor-Model, which is part of my bachelor Thesis.

Original Dataset Description

  • Source Homepage: GitHub
  • Source Contact: bertievidgen@gmail.com
  • Original Source: Dynamically-Generated-Hate-Speech-Dataset
  • Original Author List: Bertie Vidgen (The Alan Turing Institute), Tristan Thrush (Facebook AI Research), Zeerak Waseem (University of Sheffield) and Douwe Kiela (Facebook AI Research).

Refer to the Huggingface or GitHub Repo for more information

Dataset Summary

This Dataset contains dynamically generated hate-speech, already split into training (90%) and testing (10%). I inteded it to be used for classifcation tasks like this model.

Languages

The only represented language is english.

Dataset Structure

Data Instances

Each entry looks like this (train and test).

{
  'id': ...,
  'text': ,
  ''
}

Provide any additional information that is not covered in the other sections about the data here. In particular describe any relationships between data points and if these relationships are made explicit.

Data Fields

List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.

  • example_field : description of example_field

Note that the descriptions can be initialized with the Show Markdown Data Fields output of the Datasets Tagging app , you will then only need to refine the generated descriptions.

Data Splits

Describe and name the splits in the dataset if there are more than one.

Describe any criteria for splitting the data, if used. If there are differences between the splits (e.g. if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.

Provide the sizes of each split. As appropriate, provide any descriptive statistics for the features, such as average length. For example:

train validation test
Input Sentences
Average Sentence Length

Additional Information

Licensing Information

  • The original repository does not provide any license, but is free for use with proper citation of the original paper in the Proceedings of ACL 2021, available on Arxiv
  • This dataset can be used under the MIT license, with proper citation of both the original and this source.
  • But I suggest taking data from the original source and doing your own editing.

Citation Information

Please cite this repository and the original authors when using it.

Contributions

I removed some data fields and did a new split with hugging face datasets.