tweet-topic-19-multi

这是一个基于RoBERTa-base模型的训练结果，训练语料包括截至2019年底的大约9000万条推文（见 here ），并在11267个语料上进行了多标签主题分类微调（见 tweets ）。原始的RoBERTa-base模型可以在此找到（见 here ），原始参考论文为（见 TweetEval ）。该模型适用于英文。

参考论文： TimeLMs paper ， TweetTopic 。
Git仓库： TimeLMs official repository 。

标签：

0: arts_&_culture	5: fashion_&_style	10: learning_&_educational	15: science_&_technology
1: business_&_entrepreneurs	6: film_tv_&_video	11: music	16: sports
2: celebrity_&_pop_culture	7: fitness_&_health	12: news_&_social_concern	17: travel_&_adventure
3: diaries_&_daily_life	8: food_&_dining	13: other_hobbies	18: youth_&_student_life
4: family	9: gaming	14: relationships

完整分类示例

from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import expit

    
MODEL = f"cardiffnlp/tweet-topic-19-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label

text = "It is great to see athletes promoting awareness for climate change."
tokens = tokenizer(text, return_tensors='pt')
output = model(**tokens)

scores = output[0][0].detach().numpy()
scores = expit(scores)
predictions = (scores >= 0.5) * 1

# TF
#tf_model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
#class_mapping = tf_model.config.id2label
#text = "It is great to see athletes promoting awareness for climate change."
#tokens = tokenizer(text, return_tensors='tf')
#output = tf_model(**tokens)
#scores = output[0][0]
#scores = expit(scores)
#predictions = (scores >= 0.5) * 1

# Map to classes
for i in range(len(predictions)):
  if predictions[i]:
    print(class_mapping[i])

输出：

news_&_social_concern
sports

作者:

Cardiff NLP

数据集大小:

953.93 MB