数据集:

cardiffnlp/tweet_sentiment_multilingual

英文

cardiffnlp/tweet_sentiment_multilingual的数据集卡片

数据集概述

Tweet Sentiment Multilingual是一个包含8种不同语言的Twitter情感分析数据集。

  • 阿拉伯语
  • 英语
  • 法语
  • 德语
  • 印地语
  • 意大利语
  • 葡萄牙语
  • 西班牙语

支持的任务和排行榜

  • 文本分类:该数据集可以使用HuggingFace transformers中的SentenceClassification模型进行训练。

数据集结构

数据实例

表示情感配置的一个实例:

{'label': 2, 'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"'}

数据字段

对于情感配置:

  • 文本:一个包含推文的字符串特征。

  • 标签:一个整数分类标签,具有以下映射:

    0:负面

    1:中性

    2:积极

数据拆分

  • 阿拉伯语
  • 英语
  • 法语
  • 德语
  • 印地语
  • 意大利语
  • 葡萄牙语
  • 西班牙语
name train validation test
arabic 1838 323 869
english 1838 323 869
french 1838 323 869
german 1838 323 869
hindi 1838 323 869
italian 1838 323 869
portuguese 1838 323 869
spanish 1838 323 869

数据集维护者

Francesco Barbieri,Jose Camacho-Collados,Luis Espiinosa-Anke和Leonardo Neves通过Cardiff NLP进行维护。

许可信息

Creative Commons Attribution 3.0 Unported License ,所有数据集都需要遵守Twitter Terms Of Service 和Twitter API Terms Of Service 的规定。

引用信息

@inproceedings{barbieri-etal-2022-xlm,
    title = "{XLM}-{T}: Multilingual Language Models in {T}witter for Sentiment Analysis and Beyond",
    author = "Barbieri, Francesco  and
      Espinosa Anke, Luis  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.27",
    pages = "258--266",
    abstract = "Language models are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a model to train and evaluate multilingual language models in Twitter. In this paper we provide: (1) a new strong multilingual baseline consisting of an XLM-R (Conneau et al. 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently fine-tune on a target task; and (2) a set of unified sentiment analysis Twitter datasets in eight different languages and a XLM-T model trained on this dataset.",
}