模型:

readerbench/RoBERT-large

英文

RoBERT-large模型卡

语言:

  • ro

RoBERT-large

面向罗马尼亚语的预训练BERT模型

使用掩码语言建模(MLM)和下一个句子预测(NSP)目标对罗马尼亚语进行预训练的模型。它在此 paper 中被引入。发布了三个BERT模型:RoBERT-small、RoBERT-base和RoBERT-large,所有版本都是大小写不敏感的。

Model Weights L H A MLM accuracy NSP accuracy
RoBERT-small 19M 12 256 8 0.5363 0.9687
RoBERT-base 114M 12 768 12 0.6511 0.9802
RoBERT-large 341M 24 1024 24 0.6929 0.9843

所有模型都可用:

如何使用
# tensorflow
from transformers import AutoModel, AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-large")
model = TFAutoModel.from_pretrained("readerbench/RoBERT-large")
inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
outputs = model(inputs)

# pytorch
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-large")
model = AutoModel.from_pretrained("readerbench/RoBERT-large")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)

训练数据

模型是在以下语料库汇编上进行训练的。请注意,在清理过程之后,我们会呈现统计数据。

Corpus Words Sentences Size (GB)
Oscar 1.78B 87M 10.8
RoTex 240M 14M 1.5
RoWiki 50M 2M 0.3
Total 2.07B 103M 12.6

下游性能

情感分析

我们报告宏平均F1分数(以%表示)

Model Dev Test
multilingual-BERT 68.96 69.57
XLM-R-base 71.26 71.71
BERT-base-ro 70.49 71.02
RoBERT-small 66.32 66.37
RoBERT-base 70.89 71.61
RoBERT-large 72.48 72.11

摩尔多瓦语与罗马尼亚语方言和跨方言主题识别

我们报告了在 VarDial 2019 摩尔多瓦语与罗马尼亚语跨方言主题识别挑战赛上的结果,以宏平均F1分数(以%表示)

Model Dialect Classification MD to RO RO to MD
2-CNN + SVM 93.40 65.09 75.21
Char+Word SVM 96.20 69.08 81.93
BiGRU 93.30 70.10 80.30
multilingual-BERT 95.34 68.76 78.24
XLM-R-base 96.28 69.93 82.28
BERT-base-ro 96.20 69.93 78.79
RoBERT-small 95.67 69.01 80.40
RoBERT-base 97.39 68.30 81.09
RoBERT-large 97.78 69.91 83.65

发音复原

挑战可在 here 中找到。我们报告了在官方测试集上的结果,以准确率(%)表示。

Model word level char level
BiLSTM 99.42 -
CharCNN 98.40 99.65
CharCNN + multilingual-BERT 99.72 99.94
CharCNN + XLM-R-base 99.76 99.95
CharCNN + BERT-base-ro 99.79 99.95
CharCNN + RoBERT-small 99.73 99.94
CharCNN + RoBERT-base 99.78 99.95
CharCNN + RoBERT-large 99.76 99.95

BibTeX条目和引用信息

@inproceedings{masala2020robert,
  title={RoBERT--A Romanian BERT Model},
  author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6626--6637},
  year={2020}
}