数据集:
tner/tweetner7
这是TweetNER7( "Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts, AACL main conference 2022" )的官方存储库,是一个带有7个实体标签的Twitter NER数据集。每个TweetNER7实例都有一个时间戳,时间跨度从2019年9月到2021年8月。TweetNER7中使用的推文收集与 TweetTopic 中使用的相同。该数据集也集成在 TweetNLP 中。
在进行注释之前,我们对推文进行预处理以规范化一些现象,将URL转换为特殊标记{{URL}},将未经验证的用户名转换为{{USERNAME}}。对于验证过的用户名,我们用符号{@}替换其显示名称(或帐户名)。例如,一个推文
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek
被转化为以下文本。
Get the all-analog Classic Vinyl Edition
of "Takin' Off" Album from {@herbiehancock@}
via {@bluenoterecords@} link below: {{URL}}
下面是一个格式化推文的简单函数。
import re
from urlextract import URLExtract
extractor = URLExtract()
def format_tweet(tweet):
# mask web urls
urls = extractor.find_urls(tweet)
for url in urls:
tweet = tweet.replace(url, "{{URL}}")
# format twitter account
tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet)
return tweet
target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek"""
target_format = format_tweet(target)
print(target_format)
'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'
我们要求注释者忽略这些特殊标记,但是标记已验证用户的提及。
| split | number of instances | description |
|---|---|---|
| train_2020 | 4616 | training dataset from September 2019 to August 2020 |
| train_2021 | 2495 | training dataset from September 2020 to August 2021 |
| train_all | 7111 | combined training dataset of train_2020 and train_2021 |
| validation_2020 | 576 | validation dataset from September 2019 to August 2020 |
| validation_2021 | 310 | validation dataset from September 2020 to August 2021 |
| test_2020 | 576 | test dataset from September 2019 to August 2020 |
| test_2021 | 2807 | test dataset from September 2020 to August 2021 |
| train_random | 4616 | randomly sampled training dataset with the same size as train_2020 from train_all |
| validation_random | 576 | randomly sampled training dataset with the same size as validation_2020 from validation_all |
| extra_2020 | 87880 | extra tweet without annotations from September 2019 to August 2020 |
| extra_2021 | 93594 | extra tweet without annotations from September 2020 to August 2021 |
对于时间推移设置,模型应该在train_2020上进行训练,使用validation_2020进行验证,并在test_2021上进行评估。一般来说,模型会在train_all上进行训练,即具有最具代表性的训练集,使用validation_2021进行验证,并在test_2021上进行评估。
train的一个示例如下所示。
{
'tokens': ['Morning', '5km', 'run', 'with', '{{USERNAME}}', 'for', 'breast', 'cancer', 'awareness', '#', 'pinkoctober', '#', 'breastcancerawareness', '#', 'zalorafit', '#', 'zalorafitxbnwrc', '@', 'The', 'Central', 'Park', ',', 'Desa', 'Parkcity', '{{URL}}'],
'tags': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 2, 14, 2, 14, 14, 14, 14, 14, 14, 4, 11, 11, 11, 11, 14],
'id': '1183344337016381440',
'date': '2019-10-13'
}
label2id字典可以在 here 中找到。
{
"B-corporation": 0,
"B-creative_work": 1,
"B-event": 2,
"B-group": 3,
"B-location": 4,
"B-person": 5,
"B-product": 6,
"I-corporation": 7,
"I-creative_work": 8,
"I-event": 9,
"I-group": 10,
"I-location": 11,
"I-person": 12,
"I-product": 13,
"O": 14
}
查看完整的评估指标 here 。
| Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
|---|---|---|---|---|
| 12311321 | 12312321 | 12313321 | 65.75 | 61.25 |
| 12314321 | 12312321 | 12316321 | 65.16 | 60.81 |
| 12317321 | 12312321 | 12319321 | 65.68 | 61 |
| 12320321 | 12312321 | 12322321 | 65.26 | 60.7 |
| 12323321 | 12312321 | 12325321 | 66.46 | 61.87 |
| 12326321 | 12312321 | 12328321 | 65.36 | 60.52 |
| 12329321 | 12312321 | 12331321 | 63.58 | 59 |
| 12332321 | 12312321 | 12334321 | 62.3 | 57.59 |
| 12335321 | 12312321 | 12313321 | 66.02 | 60.9 |
| 12338321 | 12312321 | 12316321 | 65.47 | 60.01 |
| 12341321 | 12312321 | 12319321 | 65.87 | 61.07 |
| 12344321 | 12312321 | 12322321 | 65.51 | 60.57 |
| 12347321 | 12312321 | 12325321 | 66.41 | 61.66 |
| 12350321 | 12312321 | 12328321 | 65.84 | 61.02 |
| 12353321 | 12312321 | 12331321 | 63.2 | 57.67 |
| 12356321 | 12312321 | 12313321 | 64.05 | 59.11 |
| 12359321 | 12312321 | 12316321 | 61.76 | 57 |
| 12362321 | 12312321 | 12322321 | 63.98 | 58.91 |
| 12365321 | 12312321 | 12325321 | 62.9 | 58.13 |
| 12368321 | 12312321 | 12328321 | 63.09 | 57.35 |
| 12371321 | 12312321 | 12331321 | 59.75 | 53.93 |
| 12374321 | 12312321 | 12334321 | 60.67 | 55.5 |
| 12377321 | 12312321 | 12313321 | 64.76 | 60 |
| 12380321 | 12312321 | 12316321 | 64.21 | 59.11 |
| 12383321 | 12312321 | 12319321 | 64.28 | 59.31 |
| 12386321 | 12312321 | 12322321 | 62.87 | 58.26 |
| 12389321 | 12312321 | 12325321 | 64.01 | 59.47 |
| 12392321 | 12312321 | 12328321 | 64.06 | 59.44 |
| 12395321 | 12312321 | 12331321 | 61.43 | 56.14 |
| 12398321 | 12312321 | 12334321 | 60.09 | 54.67 |
模型描述如下。
| Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
|---|---|---|---|---|
| 123101321 | 12312321 | 12313321 | 66.33 | 60.96 |
| 123104321 | 12312321 | 12319321 | 63.29 | 58.5 |
| 123107321 | 12312321 | 12316321 | 64.04 | 59.23 |
| 123110321 | 12312321 | 12322321 | 64.72 | 59.97 |
| 123113321 | 12312321 | 12325321 | 64.86 | 60.49 |
| 123116321 | 12312321 | 12328321 | 65.55 | 59.58 |
| 123119321 | 12312321 | 12331321 | 62.39 | 57.54 |
| 123122321 | 12312321 | 12334321 | 60.91 | 55.92 |
| Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
|---|---|---|---|---|
| 123125321 | 12312321 | 12313321 | 64.56 | 59.63 |
| 123128321 | 12312321 | 12313321 | 64.6 | 59.45 |
| 123131321 | 12312321 | 12313321 | 65.46 | 60.39 |
| 123134321 | 12312321 | 12313321 | 64.52 | 59.45 |
| 123137321 | 12312321 | 12313321 | 65.15 | 60.23 |
| 123140321 | 12312321 | 12313321 | 64.48 | 59.41 |
模型描述如下。
要重现我们AACL论文上的实验结果,请参阅存储库 https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper 。
@inproceedings{ushio-etal-2022-tweet,
title = "{N}amed {E}ntity {R}ecognition in {T}witter: {A} {D}ataset and {A}nalysis on {S}hort-{T}erm {T}emporal {S}hifts",
author = "Ushio, Asahi and
Neves, Leonardo and
Silva, Vitor and
Barbieri, Francesco. and
Camacho-Collados, Jose",
booktitle = "The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
month = nov,
year = "2022",
address = "Online",
publisher = "Association for Computational Linguistics",
}