数据集:

tner/tweetner7

任务:

标记分类

子任务:

named-entity-recognition

语言:

计算机处理:

monolingual

大小:

size_categories:1k<10K

预印本库:

arxiv:2210.03797

许可:

other

数据集介绍文件清单

英文

"tner/tweetner7" 数据集卡片

数据集概要

这是TweetNER7（ "Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts, AACL main conference 2022" ）的官方存储库，是一个带有7个实体标签的Twitter NER数据集。每个TweetNER7实例都有一个时间戳，时间跨度从2019年9月到2021年8月。TweetNER7中使用的推文收集与 TweetTopic 中使用的相同。该数据集也集成在 TweetNLP 中。

实体类型：科技公司，创作作品，事件，群体，位置，产品，人物

预处理

在进行注释之前，我们对推文进行预处理以规范化一些现象，将URL转换为特殊标记{{URL}}，将未经验证的用户名转换为{{USERNAME}}。对于验证过的用户名，我们用符号{@}替换其显示名称（或帐户名）。例如，一个推文

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from @herbiehancock
via @bluenoterecords link below: 
http://bluenote.lnk.to/AlbumOfTheWeek

被转化为以下文本。

Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}

下面是一个格式化推文的简单函数。

import re
from urlextract import URLExtract
extractor = URLExtract()

def format_tweet(tweet):
    # mask web urls
    urls = extractor.find_urls(tweet)
    for url in urls:
        tweet = tweet.replace(url, "{{URL}}")
    # format twitter account
    tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
    return tweet

target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'

我们要求注释者忽略这些特殊标记，但是标记已验证用户的提及。

数据拆分

split	number of instances	description
train_2020	4616	training dataset from September 2019 to August 2020
train_2021	2495	training dataset from September 2020 to August 2021
train_all	7111	combined training dataset of train_2020 and train_2021
validation_2020	576	validation dataset from September 2019 to August 2020
validation_2021	310	validation dataset from September 2020 to August 2021
test_2020	576	test dataset from September 2019 to August 2020
test_2021	2807	test dataset from September 2020 to August 2021
train_random	4616	randomly sampled training dataset with the same size as train_2020 from train_all
validation_random	576	randomly sampled training dataset with the same size as validation_2020 from validation_all
extra_2020	87880	extra tweet without annotations from September 2019 to August 2020
extra_2021	93594	extra tweet without annotations from September 2020 to August 2021

对于时间推移设置，模型应该在train_2020上进行训练，使用validation_2020进行验证，并在test_2021上进行评估。一般来说，模型会在train_all上进行训练，即具有最具代表性的训练集，使用validation_2021进行验证，并在test_2021上进行评估。

数据集结构

数据实例

train的一个示例如下所示。

{
    'tokens': ['Morning', '5km', 'run', 'with', '{{USERNAME}}', 'for', 'breast', 'cancer', 'awareness', '#', 'pinkoctober', '#', 'breastcancerawareness', '#', 'zalorafit', '#', 'zalorafitxbnwrc', '@', 'The', 'Central', 'Park', ',', 'Desa', 'Parkcity', '{{URL}}'],
    'tags': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 2, 14, 2, 14, 14, 14, 14, 14, 14, 4, 11, 11, 11, 11, 14],
    'id': '1183344337016381440',
    'date': '2019-10-13'
}

标签ID

label2id字典可以在 here 中找到。

{
    "B-corporation": 0,
    "B-creative_work": 1,
    "B-event": 2,
    "B-group": 3,
    "B-location": 4,
    "B-person": 5,
    "B-product": 6,
    "I-corporation": 7,
    "I-creative_work": 8,
    "I-event": 9,
    "I-group": 10,
    "I-location": 11,
    "I-person": 12,
    "I-product": 13,
    "O": 14
}

模型

查看完整的评估指标 here 。

主要模型

Model (link)	Data	Language Model	Micro F1 (2021)	Macro F1 (2021)
12311321	12312321	12313321	65.75	61.25
12314321	12312321	12316321	65.16	60.81
12317321	12312321	12319321	65.68	61
12320321	12312321	12322321	65.26	60.7
12323321	12312321	12325321	66.46	61.87
12326321	12312321	12328321	65.36	60.52
12329321	12312321	12331321	63.58	59
12332321	12312321	12334321	62.3	57.59
12335321	12312321	12313321	66.02	60.9
12338321	12312321	12316321	65.47	60.01
12341321	12312321	12319321	65.87	61.07
12344321	12312321	12322321	65.51	60.57
12347321	12312321	12325321	66.41	61.66
12350321	12312321	12328321	65.84	61.02
12353321	12312321	12331321	63.2	57.67
12356321	12312321	12313321	64.05	59.11
12359321	12312321	12316321	61.76	57
12362321	12312321	12322321	63.98	58.91
12365321	12312321	12325321	62.9	58.13
12368321	12312321	12328321	63.09	57.35
12371321	12312321	12331321	59.75	53.93
12374321	12312321	12334321	60.67	55.5
12377321	12312321	12313321	64.76	60
12380321	12312321	12316321	64.21	59.11
12383321	12312321	12319321	64.28	59.31
12386321	12312321	12322321	62.87	58.26
12389321	12312321	12325321	64.01	59.47
12392321	12312321	12328321	64.06	59.44
12395321	12312321	12331321	61.43	56.14
12398321	12312321	12334321	60.09	54.67

模型描述如下。

后缀为-all的模型：在train_all上进行微调，并在validation_2021上进行验证。
后缀为-continuous的模型：在train_2020上进行微调后，连续在train_2021上进行微调，并在validation_2021上进行验证。
后缀为-2021的模型：仅在train_2021上进行微调，并在validation_2021上进行验证。
后缀为-2020的模型：仅在train_2021上进行微调，并在validation_2020上进行验证。

子模型（在消融研究中使用）

仅在train_random上进行微调，并在validation_2020上进行验证。

Model (link)	Data	Language Model	Micro F1 (2021)	Macro F1 (2021)
123101321	12312321	12313321	66.33	60.96
123104321	12312321	12319321	63.29	58.5
123107321	12312321	12316321	64.04	59.23
123110321	12312321	12322321	64.72	59.97
123113321	12312321	12325321	64.86	60.49
123116321	12312321	12328321	65.55	59.58
123119321	12312321	12331321	62.39	57.54
123122321	12312321	12334321	60.91	55.92

在extra_{2020,2021}上的自标注数据集上进行微调，并在validation_2020上进行验证。

Model (link)	Data	Language Model	Micro F1 (2021)	Macro F1 (2021)
123125321	12312321	12313321	64.56	59.63
123128321	12312321	12313321	64.6	59.45
123131321	12312321	12313321	65.46	60.39
123134321	12312321	12313321	64.52	59.45
123137321	12312321	12313321	65.15	60.23
123140321	12312321	12313321	64.48	59.41

模型描述如下。

后缀为-self2020的模型：在 tweetner7 的extra_2020拆分的自注释数据上进行微调。
后缀为-self2021的模型：在 tweetner7 的extra_2021拆分的自注释数据上进行微调。
后缀为-2020-self2020-all的模型：在 tweetner7 的extra_2020拆分的自注释数据上进行微调。结合extra_2020和train_2020的训练数据集。
后缀为-2020-self2021-all的模型：在 tweetner7 的extra_2021拆分的自注释数据上进行微调。结合extra_2021和train_2020的训练数据集。
后缀为-2020-self2020-continuous的模型：在 tweetner7 的extra_2020拆分的自注释数据上进行微调。在train_2020上进行微调，并在extra_2020上进行连续微调。
后缀为-2020-self2021-continuous的模型：在 tweetner7 的extra_2021拆分的自注释数据上进行微调。在train_2020上进行微调，并在extra_2020上进行连续微调。

重现实验结果

要重现我们AACL论文上的实验结果，请参阅存储库 https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper 。

引用信息

@inproceedings{ushio-etal-2022-tweet,
    title = "{N}amed {E}ntity {R}ecognition in {T}witter: {A} {D}ataset and {A}nalysis on {S}hort-{T}erm {T}emporal {S}hifts",
    author = "Ushio, Asahi  and
        Neves, Leonardo  and
        Silva, Vitor  and
        Barbieri, Francesco. and
        Camacho-Collados, Jose",
    booktitle = "The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
    month = nov,
    year = "2022",
    address = "Online",
    publisher = "Association for Computational Linguistics",
}

作者:

tner

数据集大小:

87.47 MB