UmBERTo Commoncrawl Cased

UmBERTo 是一个基于Roberta的语言模型，它在大型意大利语语料库上进行了训练，使用了两种创新方法：SentencePiece和完整词袋掩码。现在可以在 github.com/huggingface/transformers 中找到。

Marco Lodola，Umberto Eco纪念碑，亚历山德里亚2019

数据集

UmBERTo-Commoncrawl-Cased使用 OSCAR 的意大利语子语料库作为语言模型的训练集。我们使用了意大利语语料库的去重版本，其中包含70GB的纯文本数据，210M个句子和11B个单词。这些句子已被过滤和洗牌，以便用于NLP研究。

预训练模型

Model	WWM	Cased	Tokenizer	Vocab Size	Train Steps	Download
umberto-commoncrawl-cased-v1	YES	YES	SPM	32K	125k	1236321

此模型是使用 SentencePiece 和整词掩码进行训练的。

下游任务

这些结果是针对umberto-commoncrawl-cased模型的。所有详细信息都在 Umberto 官方页面上。

命名实体识别（NER）

Dataset	F1	Precision	Recall	Accuracy
ICAB-EvalITA07	87.565	86.596	88.556	98.690
WikiNER-ITA	92.531	92.509	92.553	99.136

部分（POS）

Dataset	F1	Precision	Recall	Accuracy
UD_Italian-ISDT	98.870	98.861	98.879	98.977
UD_Italian-ParTUT	98.786	98.812	98.760	98.903

使用

使用AutoModel、Autotokenizer加载UmBERTo：

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")

encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output

预测掩码标记：

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="Musixmatch/umberto-commoncrawl-cased-v1",
    tokenizer="Musixmatch/umberto-commoncrawl-cased-v1"
)

result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> Umberto Eco è considerato un grande scrittore</s>', 'score': 0.18599839508533478, 'token': 5032}
# {'sequence': '<s> Umberto Eco è stato un grande scrittore</s>', 'score': 0.17816807329654694, 'token': 471}
# {'sequence': '<s> Umberto Eco è sicuramente un grande scrittore</s>', 'score': 0.16565583646297455, 'token': 2654}
# {'sequence': '<s> Umberto Eco è indubbiamente un grande scrittore</s>', 'score': 0.0932890921831131, 'token': 17908}
# {'sequence': '<s> Umberto Eco è certamente un grande scrittore</s>', 'score': 0.054701317101716995, 'token': 5269}

引文

所有原始数据集都是公开可用的，或在所有者授权下发布的。这些数据集都是根据CC0或CCBY许可证发布的。

UD意大利语-ISDT数据集 Github
UD意大利语-ParTUT数据集 Github
I-CAB（意大利语内容注释库），EvalITA Page
WIKINER Page , Paper

@inproceedings {magnini2006annotazione,
    title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
    author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
    booktitle = {Proc.of SILFI 2006},
    year = {2006}
}
@inproceedings {magnini2006cab,
    title = {I - CAB: the Italian Content Annotation Bank.},
    author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
    booktitle = {LREC},
    pages = {963--968},
    year = {2006},
    organization = {Citeseer}
}

作者

Loreto Parisi : loreto at musixmatch dot com , loretoparisi Simone Francia : simone.francia at musixmatch dot com , simonefrancia Paolo Magnani : paul.magnani95 at gmail dot com , paulthemagno

关于Musixmatch AI

我们在Musixmatch AI进行机器学习和人工智能的研究。关注我们的 musixmatch 在 Twitter Github 上。

作者:

Musixmatch

数据集大小:

426.78 MB