模型:

readerbench/RoBERT-base

英文

RoBERT-base模型概述

语言:

  • ro

RoBERT-base

面向罗马尼亚语的预训练BERT模型

使用掩码语言建模(MLM)和下一句预测(NSP)目标在罗马尼亚语上进行预训练的模型。它是在这个 paper 中引入的。发布了三个BERT模型:RoBERT-small、RoBERT-base和RoBERT-large,所有版本都是不区分大小写的。

Model Weights L H A MLM accuracy NSP accuracy
RoBERT-small 19M 12 256 8 0.5363 0.9687
RoBERT-base 114M 12 768 12 0.6511 0.9802
RoBERT-large 341M 24 1024 24 0.6929 0.9843

所有模型都可用:

如何使用
# tensorflow
from transformers import AutoModel, AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = TFAutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
outputs = model(inputs)

# pytorch
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = AutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)

训练数据

模型使用以下语料库的组合进行训练。请注意,我们在进行清理处理后呈现统计信息。

Corpus Words Sentences Size (GB)
Oscar 1.78B 87M 10.8
RoTex 240M 14M 1.5
RoWiki 50M 2M 0.3
Total 2.07B 103M 12.6

下游性能

情感分析

我们报告宏平均F1得分(以%表示)

Model Dev Test
multilingual-BERT 68.96 69.57
XLM-R-base 71.26 71.71
BERT-base-ro 70.49 71.02
RoBERT-small 66.32 66.37
RoBERT-base 70.89 71.61
RoBERT-large 72.48 72.11

摩尔多瓦语和罗马尼亚语方言以及方言间主题识别

我们报告了 VarDial 2019 摩尔多瓦语和罗马尼亚语方言识别挑战的结果,宏平均F1得分(以%表示)。

Model Dialect Classification MD to RO RO to MD
2-CNN + SVM 93.40 65.09 75.21
Char+Word SVM 96.20 69.08 81.93
BiGRU 93.30 70.10 80.30
multilingual-BERT 95.34 68.76 78.24
XLM-R-base 96.28 69.93 82.28
BERT-base-ro 96.20 69.93 78.79
RoBERT-small 95.67 69.01 80.40
RoBERT-base 97.39 68.30 81.09
RoBERT-large 97.78 69.91 83.65

音标恢复

挑战详见 here 。我们报告了官方测试集上的准确度(以%表示)。

Model word level char level
BiLSTM 99.42 -
CharCNN 98.40 99.65
CharCNN + multilingual-BERT 99.72 99.94
CharCNN + XLM-R-base 99.76 99.95
CharCNN + BERT-base-ro 99.79 99.95
CharCNN + RoBERT-small 99.73 99.94
CharCNN + RoBERT-base 99.78 99.95
CharCNN + RoBERT-large 99.76 99.95

BibTeX条目和引用信息

@inproceedings{masala2020robert,
  title={RoBERT--A Romanian BERT Model},
  author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6626--6637},
  year={2020}
}