RoBERTweet模型是基于BERT的,基于大小写的,训练于2008年至2022年之间的所有罗马尼亚推文。
import torch from transformers import AutoTokenizer, AutoModel # Load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained("Iulian277/ro-bert-tweet") model = AutoModel.from_pretrained("Iulian277/ro-bert-tweet") # Sanitize the input !pip install emoji from normalize import normalize # Use the `normalize` function from the `normalize.py` script normalized_text = normalize("Salut, ce faci?") # Tokenize the sentence and run through the model input_ids = torch.tensor(tokenizer.encode(normalized_text, add_special_tokens=True)).unsqueeze(0) # Batch size 1 outputs = model(input_ids) # Get encoding last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
在使用分词器前,请始终使用存储库中包含的normalize.py脚本对输入文本进行规范化处理,否则由于[UNK]标记,性能会降低。
我们要感谢 TPU Research Cloud 提供的用于预训练RoBERTweet模型所需的TPU计算能力。