数据集:

cardiffnlp/tweet_topic_single

英文

数据集 “cardiffnlp/tweet_topic_single”的数据卡片

数据集概述

这是TweetTopic( "Twitter Topic Classification , COLING main conference 2022" )的官方存储库,这是一个在Twitter上进行主题分类的数据集,包含6个标签。每个TweetTopic实例都带有时间戳,时间范围从2019年9月到2021年8月。有关TweetTopic的多标签版本,请参见 cardiffnlp/tweet_topic_multi 。TweetTopic中使用的推文收集与 TweetNER7 中使用的相同。该数据集也集成在 TweetNLP 中。

预处理

我们在注释之前对推文进行预处理,以规范化一些工件,将URL转换为特殊标记{{URL}}和非验证用户名转换为{{USERNAME}}。对于经过验证的用户名,我们用符号{@}替换其显示名称(或帐户名称)。例如,一个推文

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from @herbiehancock
via @bluenoterecords link below: 
http://bluenote.lnk.to/AlbumOfTheWeek
转换为以下文本。

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}

以下是格式化推文的简单函数。

import re
from urlextract import URLExtract
extractor = URLExtract()
def format_tweet(tweet):
    # mask web urls
    urls = extractor.find_urls(tweet)
    for url in urls:
        tweet = tweet.replace(url, "{{URL}}")
    # format twitter account
    tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
    return tweet
target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'

数据拆分

split number of texts description
test_2020 376 test dataset from September 2019 to August 2020
test_2021 1693 test dataset from September 2020 to August 2021
train_2020 2858 training dataset from September 2019 to August 2020
train_2021 1516 training dataset from September 2020 to August 2021
train_all 4374 combined training dataset of train_2020 and train_2021
validation_2020 352 validation dataset from September 2019 to August 2020
validation_2021 189 validation dataset from September 2020 to August 2021
train_random 2830 randomly sampled training dataset with the same size as train_2020 from train_all
validation_random 354 randomly sampled training dataset with the same size as validation_2020 from validation_all
test_coling2022_random 3399 random split used in the COLING 2022 paper
train_coling2022_random 3598 random split used in the COLING 2022 paper
test_coling2022 3399 temporal split used in the COLING 2022 paper
train_coling2022 3598 temporal split used in the COLING 2022 paper

对于时间偏移设置,模型应该在train_2020上训练,用validation_2020进行验证,并在test_2021上进行评估。一般情况下,模型应该在train_all上进行训练,该数据集是最具代表性的训练集,包括validation_2021,并在test_2021上进行评估。

重要说明:为了获得与COLING 2022 Tweet Topic论文结果可比较的结果,请在时间偏移中使用train_coling2022和test_coling2022,在随机拆分中使用train_coling2022_random和test_coling2022_random(coling2022拆分没有验证集)。

模型

model training data F1 F1 (macro) Accuracy
12310321 all (2020 + 2021) 0.896043 0.800061 0.896043
12311321 all (2020 + 2021) 0.887773 0.79793 0.887773
12312321 all (2020 + 2021) 0.892499 0.774494 0.892499
12313321 all (2020 + 2021) 0.890136 0.776025 0.890136
12314321 all (2020 + 2021) 0.894861 0.800952 0.894861
12315321 2020 only 0.878913 0.70565 0.878913
12316321 2020 only 0.868281 0.729667 0.868281
12317321 2020 only 0.882457 0.740187 0.882457
12318321 2020 only 0.87596 0.746275 0.87596
12319321 2020 only 0.877732 0.746119 0.877732

可以在 here 中找到模型微调脚本。

数据集结构

数据实例

train的一个示例如下。

{
    "text": "Game day for {{USERNAME}} U18\u2019s against {{USERNAME}} U18\u2019s. Even though it\u2019s a \u2018home\u2019 game for the people that have settled in Mid Wales it\u2019s still a 4 hour round trip for us up to Colwyn Bay. Still enjoy it though!",
    "date": "2019-09-08",
    "label": 4,
    "id": "1170606779568463874",
    "label_name": "sports_&_gaming"
}

标签ID

label2id字典可以在 here 中找到。

{
    "arts_&_culture": 0,
    "business_&_entrepreneurs": 1,
    "pop_culture": 2,
    "daily_life": 3,
    "sports_&_gaming": 4,
    "science_&_technology": 5
}

引用信息

@inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}