模型:
cardiffnlp/roberta-large-tweet-topic-multi-2020
这个模型是在 tweet_topic_multi 上进行微调的 roberta-large 的版本。该模型在 train_2020 数据集上进行微调,并在 test_2021 数据集上进行验证。微调脚本可以在 here 找到。该模型在 test_2021 数据集上取得以下结果:
import math import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer def sigmoid(x): return 1 / (1 + math.exp(-x)) tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/roberta-large-tweet-topic-multi-2020") model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/roberta-large-tweet-topic-multi-2020", problem_type="multi_label_classification") model.eval() class_mapping = model.config.id2label with torch.no_grad(): text = #NewVideo Cray Dollas- Water- Ft. Charlie Rose- (Official Music Video)- {{URL}} via {@YouTube@} #watchandlearn {{USERNAME}} tokens = tokenizer(text, return_tensors='pt') output = model(**tokens) flags = [sigmoid(s) > 0.5 for s in output[0][0].detach().tolist()] topic = [class_mapping[n] for n, i in enumerate(flags) if i] print(topic)
@inproceedings{dimosthenis-etal-2022-twitter, title = "{T}witter {T}opic {C}lassification", author = "Antypas, Dimosthenis and Ushio, Asahi and Camacho-Collados, Jose and Neves, Leonardo and Silva, Vitor and Barbieri, Francesco", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics" }