数据集:
tner/tweetner7
这是TweetNER7( "Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts, AACL main conference 2022" )的官方存储库,是一个带有7个实体标签的Twitter NER数据集。每个TweetNER7实例都有一个时间戳,时间跨度从2019年9月到2021年8月。TweetNER7中使用的推文收集与 TweetTopic 中使用的相同。该数据集也集成在 TweetNLP 中。
在进行注释之前,我们对推文进行预处理以规范化一些现象,将URL转换为特殊标记{{URL}},将未经验证的用户名转换为{{USERNAME}}。对于验证过的用户名,我们用符号{@}替换其显示名称(或帐户名)。例如,一个推文
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek
被转化为以下文本。
Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}
下面是一个格式化推文的简单函数。
import re from urlextract import URLExtract extractor = URLExtract() def format_tweet(tweet): # mask web urls urls = extractor.find_urls(tweet) for url in urls: tweet = tweet.replace(url, "{{URL}}") # format twitter account tweet = re.sub(r"\b(\s*)(@[\S]+)\b", r'\1{\2@}', tweet) return tweet target = """Get the all-analog Classic Vinyl Edition of "Takin' Off" Album from @herbiehancock via @bluenoterecords link below: http://bluenote.lnk.to/AlbumOfTheWeek""" target_format = format_tweet(target) print(target_format) 'Get the all-analog Classic Vinyl Edition of "Takin\' Off" Album from {@herbiehancock@} via {@bluenoterecords@} link below: {{URL}}'
我们要求注释者忽略这些特殊标记,但是标记已验证用户的提及。
split | number of instances | description |
---|---|---|
train_2020 | 4616 | training dataset from September 2019 to August 2020 |
train_2021 | 2495 | training dataset from September 2020 to August 2021 |
train_all | 7111 | combined training dataset of train_2020 and train_2021 |
validation_2020 | 576 | validation dataset from September 2019 to August 2020 |
validation_2021 | 310 | validation dataset from September 2020 to August 2021 |
test_2020 | 576 | test dataset from September 2019 to August 2020 |
test_2021 | 2807 | test dataset from September 2020 to August 2021 |
train_random | 4616 | randomly sampled training dataset with the same size as train_2020 from train_all |
validation_random | 576 | randomly sampled training dataset with the same size as validation_2020 from validation_all |
extra_2020 | 87880 | extra tweet without annotations from September 2019 to August 2020 |
extra_2021 | 93594 | extra tweet without annotations from September 2020 to August 2021 |
对于时间推移设置,模型应该在train_2020上进行训练,使用validation_2020进行验证,并在test_2021上进行评估。一般来说,模型会在train_all上进行训练,即具有最具代表性的训练集,使用validation_2021进行验证,并在test_2021上进行评估。
train的一个示例如下所示。
{ 'tokens': ['Morning', '5km', 'run', 'with', '{{USERNAME}}', 'for', 'breast', 'cancer', 'awareness', '#', 'pinkoctober', '#', 'breastcancerawareness', '#', 'zalorafit', '#', 'zalorafitxbnwrc', '@', 'The', 'Central', 'Park', ',', 'Desa', 'Parkcity', '{{URL}}'], 'tags': [14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 2, 14, 2, 14, 14, 14, 14, 14, 14, 4, 11, 11, 11, 11, 14], 'id': '1183344337016381440', 'date': '2019-10-13' }
label2id字典可以在 here 中找到。
{ "B-corporation": 0, "B-creative_work": 1, "B-event": 2, "B-group": 3, "B-location": 4, "B-person": 5, "B-product": 6, "I-corporation": 7, "I-creative_work": 8, "I-event": 9, "I-group": 10, "I-location": 11, "I-person": 12, "I-product": 13, "O": 14 }
查看完整的评估指标 here 。
Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
---|---|---|---|---|
12311321 | 12312321 | 12313321 | 65.75 | 61.25 |
12314321 | 12312321 | 12316321 | 65.16 | 60.81 |
12317321 | 12312321 | 12319321 | 65.68 | 61 |
12320321 | 12312321 | 12322321 | 65.26 | 60.7 |
12323321 | 12312321 | 12325321 | 66.46 | 61.87 |
12326321 | 12312321 | 12328321 | 65.36 | 60.52 |
12329321 | 12312321 | 12331321 | 63.58 | 59 |
12332321 | 12312321 | 12334321 | 62.3 | 57.59 |
12335321 | 12312321 | 12313321 | 66.02 | 60.9 |
12338321 | 12312321 | 12316321 | 65.47 | 60.01 |
12341321 | 12312321 | 12319321 | 65.87 | 61.07 |
12344321 | 12312321 | 12322321 | 65.51 | 60.57 |
12347321 | 12312321 | 12325321 | 66.41 | 61.66 |
12350321 | 12312321 | 12328321 | 65.84 | 61.02 |
12353321 | 12312321 | 12331321 | 63.2 | 57.67 |
12356321 | 12312321 | 12313321 | 64.05 | 59.11 |
12359321 | 12312321 | 12316321 | 61.76 | 57 |
12362321 | 12312321 | 12322321 | 63.98 | 58.91 |
12365321 | 12312321 | 12325321 | 62.9 | 58.13 |
12368321 | 12312321 | 12328321 | 63.09 | 57.35 |
12371321 | 12312321 | 12331321 | 59.75 | 53.93 |
12374321 | 12312321 | 12334321 | 60.67 | 55.5 |
12377321 | 12312321 | 12313321 | 64.76 | 60 |
12380321 | 12312321 | 12316321 | 64.21 | 59.11 |
12383321 | 12312321 | 12319321 | 64.28 | 59.31 |
12386321 | 12312321 | 12322321 | 62.87 | 58.26 |
12389321 | 12312321 | 12325321 | 64.01 | 59.47 |
12392321 | 12312321 | 12328321 | 64.06 | 59.44 |
12395321 | 12312321 | 12331321 | 61.43 | 56.14 |
12398321 | 12312321 | 12334321 | 60.09 | 54.67 |
模型描述如下。
Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
---|---|---|---|---|
123101321 | 12312321 | 12313321 | 66.33 | 60.96 |
123104321 | 12312321 | 12319321 | 63.29 | 58.5 |
123107321 | 12312321 | 12316321 | 64.04 | 59.23 |
123110321 | 12312321 | 12322321 | 64.72 | 59.97 |
123113321 | 12312321 | 12325321 | 64.86 | 60.49 |
123116321 | 12312321 | 12328321 | 65.55 | 59.58 |
123119321 | 12312321 | 12331321 | 62.39 | 57.54 |
123122321 | 12312321 | 12334321 | 60.91 | 55.92 |
Model (link) | Data | Language Model | Micro F1 (2021) | Macro F1 (2021) |
---|---|---|---|---|
123125321 | 12312321 | 12313321 | 64.56 | 59.63 |
123128321 | 12312321 | 12313321 | 64.6 | 59.45 |
123131321 | 12312321 | 12313321 | 65.46 | 60.39 |
123134321 | 12312321 | 12313321 | 64.52 | 59.45 |
123137321 | 12312321 | 12313321 | 65.15 | 60.23 |
123140321 | 12312321 | 12313321 | 64.48 | 59.41 |
模型描述如下。
要重现我们AACL论文上的实验结果,请参阅存储库 https://github.com/asahi417/tner/tree/master/examples/tweetner7_paper 。
@inproceedings{ushio-etal-2022-tweet, title = "{N}amed {E}ntity {R}ecognition in {T}witter: {A} {D}ataset and {A}nalysis on {S}hort-{T}erm {T}emporal {S}hifts", author = "Ushio, Asahi and Neves, Leonardo and Silva, Vitor and Barbieri, Francesco. and Camacho-Collados, Jose", booktitle = "The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing", month = nov, year = "2022", address = "Online", publisher = "Association for Computational Linguistics", }