模型:

cardiffnlp/roberta-large-tweet-topic-multi-all

英文

cardiffnlp/roberta-large-tweet-topic-multi-all

这个模型是在 tweet_topic_multi 上进行微调的版本。该模型在tweet_topic的train_all拆分上进行了微调,并在test_2021拆分上进行了验证。微调脚本可以在 here 找到。该模型在test_2021数据集上取得了以下结果:

  • F1(微平均):0.7631035905901775
  • F1(宏平均):0.6202570779365779
  • 准确率:0.5366289458010721

用法

import math
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

def sigmoid(x):
  return 1 / (1 + math.exp(-x))
  
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/roberta-large-tweet-topic-multi-all")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/roberta-large-tweet-topic-multi-all", problem_type="multi_label_classification")
model.eval()
class_mapping = model.config.id2label

with torch.no_grad():
  text = #NewVideo Cray Dollas- Water- Ft. Charlie Rose- (Official Music Video)- {{URL}} via {@YouTube@} #watchandlearn {{USERNAME}}
  tokens = tokenizer(text, return_tensors='pt')
  output = model(**tokens)
  flags = [sigmoid(s) > 0.5 for s in output[0][0].detach().tolist()]
  topic = [class_mapping[n] for n, i in enumerate(flags) if i]
print(topic)

引用

@inproceedings{dimosthenis-etal-2022-twitter,
    title = "{T}witter {T}opic {C}lassification",
    author = "Antypas, Dimosthenis  and
    Ushio, Asahi  and
    Camacho-Collados, Jose  and
    Neves, Leonardo  and
    Silva, Vitor  and
    Barbieri, Francesco",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics"
}