UmBERTo Wikipedia Uncased

UmBERTo 是基于Roberta的语言模型，训练于大型意大利语语料库，并采用了两种创新方法：SentencePiece和Whole Word Masking。现在可在 github.com/huggingface/transformers 处获取。

Marco Lodola，乌贝尔托·埃科纪念碑，阿莱山德里亚2019

数据集

UmBERTo-Wikipedia-Uncased训练基于从 Wikipedia-ITA 中提取的相对较小的语料库（约7GB）。

预训练模型

Model	WWM	Cased	Tokenizer	Vocab Size	Train Steps	Download
umberto-wikipedia-uncased-v1	YES	YES	SPM	32K	100k	1236321

该模型使用 SentencePiece 和Whole Word Masking进行训练。

下游任务

这些结果是关于umberto-wikipedia-uncased模型的。所有细节请查看 Umberto 官方页面。

Named Entity Recognition (NER)

Dataset	F1	Precision	Recall	Accuracy
ICAB-EvalITA07	86.240	85.939	86.544	98.534
WikiNER-ITA	90.483	90.328	90.638	98.661

Part of Speech (POS)

Dataset	F1	Precision	Recall	Accuracy
UD_Italian-ISDT	98.563	98.508	98.618	98.717
UD_Italian-ParTUT	97.810	97.835	97.784	98.060

用法

使用AutoModel、Autotokenizer加载UmBERTo Wikipedia Uncased：

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")

encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output

预测掩码标记：

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="Musixmatch/umberto-wikipedia-uncased-v1",
    tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
)

result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}

引用

所有原始数据集都是公开可用的，或在所有者的授权下发布。数据集均采用CC0或CCBY许可协议发布。

UD Italian-ISDT数据集 Github
UD Italian-ParTUT数据集 Github
I-CAB（意大利内容标注库），EvalITA Page
WIKINER Page , Paper

@inproceedings {magnini2006annotazione,
    title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
    author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
    booktitle = {Proc.of SILFI 2006},
    year = {2006}
}
@inproceedings {magnini2006cab,
    title = {I - CAB: the Italian Content Annotation Bank.},
    author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
    booktitle = {LREC},
    pages = {963--968},
    year = {2006},
    organization = {Citeseer}
}

作者

Loreto Parisi：loreto at musixmatch dot com， loretoparisi Simone Francia：simone.francia at musixmatch dot com， simonefrancia Paolo Magnani：paul.magnani95 at gmail dot com， paulthemagno

关于Musixmatch AI

我们在Musixmatch进行机器学习和人工智能@ musixmatch 关注我们 Twitter Github

作者:

Musixmatch

数据集大小:

428.07 MB