模型:
Musixmatch/umberto-commoncrawl-cased-v1
UmBERTo 是一个基于Roberta的语言模型,它在大型意大利语语料库上进行了训练,使用了两种创新方法:SentencePiece和完整词袋掩码。 现在可以在 github.com/huggingface/transformers 中找到。
Marco Lodola,Umberto Eco纪念碑,亚历山德里亚2019
UmBERTo-Commoncrawl-Cased使用 OSCAR 的意大利语子语料库作为语言模型的训练集。我们使用了意大利语语料库的去重版本,其中包含70GB的纯文本数据,210M个句子和11B个单词。这些句子已被过滤和洗牌,以便用于NLP研究。
Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
---|---|---|---|---|---|---|
umberto-commoncrawl-cased-v1 | YES | YES | SPM | 32K | 125k | 1236321 |
此模型是使用 SentencePiece 和整词掩码进行训练的。
这些结果是针对umberto-commoncrawl-cased模型的。所有详细信息都在 Umberto 官方页面上。
命名实体识别(NER)Dataset | F1 | Precision | Recall | Accuracy |
---|---|---|---|---|
ICAB-EvalITA07 | 87.565 | 86.596 | 88.556 | 98.690 |
WikiNER-ITA | 92.531 | 92.509 | 92.553 | 99.136 |
Dataset | F1 | Precision | Recall | Accuracy |
---|---|---|---|---|
UD_Italian-ISDT | 98.870 | 98.861 | 98.879 | 98.977 |
UD_Italian-ParTUT | 98.786 | 98.812 | 98.760 | 98.903 |
import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1") umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1") encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore") input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1 outputs = umberto(input_ids) last_hidden_states = outputs[0] # The last hidden-state is the first element of the output预测掩码标记:
from transformers import pipeline fill_mask = pipeline( "fill-mask", model="Musixmatch/umberto-commoncrawl-cased-v1", tokenizer="Musixmatch/umberto-commoncrawl-cased-v1" ) result = fill_mask("Umberto Eco è <mask> un grande scrittore") # {'sequence': '<s> Umberto Eco è considerato un grande scrittore</s>', 'score': 0.18599839508533478, 'token': 5032} # {'sequence': '<s> Umberto Eco è stato un grande scrittore</s>', 'score': 0.17816807329654694, 'token': 471} # {'sequence': '<s> Umberto Eco è sicuramente un grande scrittore</s>', 'score': 0.16565583646297455, 'token': 2654} # {'sequence': '<s> Umberto Eco è indubbiamente un grande scrittore</s>', 'score': 0.0932890921831131, 'token': 17908} # {'sequence': '<s> Umberto Eco è certamente un grande scrittore</s>', 'score': 0.054701317101716995, 'token': 5269}
所有原始数据集都是公开可用的,或在所有者授权下发布的。这些数据集都是根据CC0或CCBY许可证发布的。
@inproceedings {magnini2006annotazione, title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB}, author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo}, booktitle = {Proc.of SILFI 2006}, year = {2006} } @inproceedings {magnini2006cab, title = {I - CAB: the Italian Content Annotation Bank.}, author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele}, booktitle = {LREC}, pages = {963--968}, year = {2006}, organization = {Citeseer} }
Loreto Parisi : loreto at musixmatch dot com , loretoparisi Simone Francia : simone.francia at musixmatch dot com , simonefrancia Paolo Magnani : paul.magnani95 at gmail dot com , paulthemagno
我们在Musixmatch AI进行机器学习和人工智能的研究。关注我们的 musixmatch 在 Twitter Github 上。