数据集:

cardiffnlp/tweet_topic_multi

英文

"cardiffnlp/tweet_topic_multi" 数据集卡片

数据集概述

这是 TweetTopic( "Twitter Topic Classification , COLING main conference 2022" )的官方存储库,它是一个包含19个标签的Twitter主题分类数据集。每个TweetTopic实例都带有一个时间戳,时间范围从2019年9月到2021年8月。有关TweetTopic的单标签版本,请参见 cardiffnlp/tweet_topic_single 。TweetTopic中使用的推文收集与 TweetNER7 中所使用的相同。该数据集也集成在 TweetNLP 中。

预处理

在注释之前,我们对推文进行预处理以规范化一些文本特征,将URL转换为特殊令牌 {{URL}},将未经验证的用户名替换为 {{USERNAME}}。对于经过验证的用户名,我们使用符号 {@} 替换其显示名称(或帐户名)。例如,一个推文

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from @herbiehancock
via @bluenoterecords link below: 
http://bluenote.lnk.to/AlbumOfTheWeek

将转换为以下文本。

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}

下面是一个格式化推文的简单函数。

import re
from urlextract import URLExtract
extractor = URLExtract()

def format_tweet(tweet):
    # mask web urls
    urls = extractor.find_urls(tweet)
    for url in urls:
        tweet = tweet.replace(url, "{{URL}}")
    # format twitter account
    tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
    return tweet

target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'

数据拆分

split number of texts description
test_2020 573 test dataset from September 2019 to August 2020
test_2021 1679 test dataset from September 2020 to August 2021
train_2020 4585 training dataset from September 2019 to August 2020
train_2021 1505 training dataset from September 2020 to August 2021
train_all 6090 combined training dataset of train_2020 and train_2021
validation_2020 573 validation dataset from September 2019 to August 2020
validation_2021 188 validation dataset from September 2020 to August 2021
train_random 4564 randomly sampled training dataset with the same size as train_2020 from train_all
validation_random 573 randomly sampled training dataset with the same size as validation_2020 from validation_all
test_coling2022_random 5536 random split used in the COLING 2022 paper
train_coling2022_random 5731 random split used in the COLING 2022 paper
test_coling2022 5536 temporal split used in the COLING 2022 paper
train_coling2022 5731 temporal split used in the COLING 2022 paper

对于时间偏移设置,模型应在 train_2020 上进行训练,使用 validation_2020 进行验证,并在 test_2021 上进行评估。通常,模型将在 train_all 上进行训练,这是最具代表性的训练集,使用 validation_2021 进行验证,并在 test_2021 上进行评估。

重要提示:为了得到与 COLING 2022 Tweet Topic 论文结果可比较的结果,请使用 train_coling2022 进行时间偏移,使用 test_coling2022 进行评估,并使用 train_coling2022_random 进行随机拆分(coling2022拆分没有验证集)。

模型

model training data F1 F1 (macro) Accuracy
12310321 all (2020 + 2021) 0.763104 0.620257 0.536629
12311321 all (2020 + 2021) 0.751814 0.600782 0.531864
12312321 all (2020 + 2021) 0.762513 0.603533 0.547945
12313321 all (2020 + 2021) 0.759917 0.59901 0.536033
12314321 all (2020 + 2021) 0.764767 0.618702 0.548541
12315321 2020 only 0.732366 0.579456 0.493746
12316321 2020 only 0.725229 0.561261 0.499107
12317321 2020 only 0.73671 0.565624 0.513401
12318321 2020 only 0.729446 0.534799 0.50268
12319321 2020 only 0.731106 0.532141 0.509827

可以在 here 找到模型微调脚本.

数据集结构

数据实例

train 的一个示例如下。

{
    "date": "2021-03-07",
    "text": "The latest The Movie theater Daily! {{URL}} Thanks to {{USERNAME}} {{USERNAME}} {{USERNAME}} #lunchtimeread #amc1000",
    "id": "1368464923370676231",
    "label": [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "label_name": ["film_tv_&_video"]
}

标签ID

可以在 here 找到 label2id 字典。

{
    "arts_&_culture": 0,
    "business_&_entrepreneurs": 1,
    "celebrity_&_pop_culture": 2,
    "diaries_&_daily_life": 3,
    "family": 4,
    "fashion_&_style": 5,
    "film_tv_&_video": 6,
    "fitness_&_health": 7,
    "food_&_dining": 8,
    "gaming": 9,
    "learning_&_educational": 10,
    "music": 11,
    "news_&_social_concern": 12,
    "other_hobbies": 13,
    "relationships": 14,
    "science_&_technology": 15,
    "sports": 16,
    "travel_&_adventure": 17,
    "youth_&_student_life": 18
 }

引用信息

@inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}