"cardiffnlp/tweet_topic_multi" 数据集卡片

数据集概述

这是 TweetTopic（ "Twitter Topic Classification , COLING main conference 2022" ）的官方存储库，它是一个包含19个标签的Twitter主题分类数据集。每个TweetTopic实例都带有一个时间戳，时间范围从2019年9月到2021年8月。有关TweetTopic的单标签版本，请参见 cardiffnlp/tweet_topic_single 。TweetTopic中使用的推文收集与 TweetNER7 中所使用的相同。该数据集也集成在 TweetNLP 中。

预处理

在注释之前，我们对推文进行预处理以规范化一些文本特征，将URL转换为特殊令牌 {{URL}}，将未经验证的用户名替换为 {{USERNAME}}。对于经过验证的用户名，我们使用符号 {@} 替换其显示名称（或帐户名）。例如，一个推文

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from @herbiehancock
via @bluenoterecords link below: 
http://bluenote.lnk.to/AlbumOfTheWeek

将转换为以下文本。

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}

下面是一个格式化推文的简单函数。

import re
from urlextract import URLExtract
extractor = URLExtract()

def format_tweet(tweet):
    # mask web urls
    urls = extractor.find_urls(tweet)
    for url in urls:
        tweet = tweet.replace(url, "{{URL}}")
    # format twitter account
    tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
    return tweet

target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'

数据拆分

split	number of texts	description
test_2020	573	test dataset from September 2019 to August 2020
test_2021	1679	test dataset from September 2020 to August 2021
train_2020	4585	training dataset from September 2019 to August 2020
train_2021	1505	training dataset from September 2020 to August 2021
train_all	6090	combined training dataset of train_2020 and train_2021
validation_2020	573	validation dataset from September 2019 to August 2020
validation_2021	188	validation dataset from September 2020 to August 2021
train_random	4564	randomly sampled training dataset with the same size as train_2020 from train_all
validation_random	573	randomly sampled training dataset with the same size as validation_2020 from validation_all
test_coling2022_random	5536	random split used in the COLING 2022 paper
train_coling2022_random	5731	random split used in the COLING 2022 paper
test_coling2022	5536	temporal split used in the COLING 2022 paper
train_coling2022	5731	temporal split used in the COLING 2022 paper

对于时间偏移设置，模型应在 train_2020 上进行训练，使用 validation_2020 进行验证，并在 test_2021 上进行评估。通常，模型将在 train_all 上进行训练，这是最具代表性的训练集，使用 validation_2021 进行验证，并在 test_2021 上进行评估。

重要提示：为了得到与 COLING 2022 Tweet Topic 论文结果可比较的结果，请使用 train_coling2022 进行时间偏移，使用 test_coling2022 进行评估，并使用 train_coling2022_random 进行随机拆分（coling2022拆分没有验证集）。

模型

model	training data	F1	F1 (macro)	Accuracy
12310321	all (2020 + 2021)	0.763104	0.620257	0.536629
12311321	all (2020 + 2021)	0.751814	0.600782	0.531864
12312321	all (2020 + 2021)	0.762513	0.603533	0.547945
12313321	all (2020 + 2021)	0.759917	0.59901	0.536033
12314321	all (2020 + 2021)	0.764767	0.618702	0.548541
12315321	2020 only	0.732366	0.579456	0.493746
12316321	2020 only	0.725229	0.561261	0.499107
12317321	2020 only	0.73671	0.565624	0.513401
12318321	2020 only	0.729446	0.534799	0.50268
12319321	2020 only	0.731106	0.532141	0.509827

可以在 here 找到模型微调脚本.

数据集结构

数据实例

train 的一个示例如下。

{
    "date": "2021-03-07",
    "text": "The latest The Movie theater Daily! {{URL}} Thanks to {{USERNAME}} {{USERNAME}} {{USERNAME}} #lunchtimeread #amc1000",
    "id": "1368464923370676231",
    "label": [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "label_name": ["film_tv_&_video"]
}

标签ID

可以在 here 找到 label2id 字典。

{
    "arts_&_culture": 0,
    "business_&_entrepreneurs": 1,
    "celebrity_&_pop_culture": 2,
    "diaries_&_daily_life": 3,
    "family": 4,
    "fashion_&_style": 5,
    "film_tv_&_video": 6,
    "fitness_&_health": 7,
    "food_&_dining": 8,
    "gaming": 9,
    "learning_&_educational": 10,
    "music": 11,
    "news_&_social_concern": 12,
    "other_hobbies": 13,
    "relationships": 14,
    "science_&_technology": 15,
    "sports": 16,
    "travel_&_adventure": 17,
    "youth_&_student_life": 18
 }

引用信息

@inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}

作者:

cardiffnlp

数据集大小:

12.41 MB