模型:

readerbench/RoBERT-small

英文

RoBERT-small 模型卡片

语言:

  • ro

RoBERT-small

RoBERT-small 预训练模型(针对罗马尼亚语)

使用掩码语言模型(MLM)和下一个句子预测(NSP)目标在罗马尼亚语上进行的预训练模型。该模型首次在此 paper 中介绍。发布了三个BERT模型:RoBERT-small、RoBERT-base和RoBERT-large,所有版本均为非大小写敏感。

Model Weights L H A MLM accuracy NSP accuracy
RoBERT-small 19M 12 256 8 0.5363 0.9687
RoBERT-base 114M 12 768 12 0.6511 0.9802
RoBERT-large 341M 24 1024 24 0.6929 0.9843

所有模型都可用:

使用方法
# tensorflow
from transformers import AutoModel, AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
model = TFAutoModel.from_pretrained("readerbench/RoBERT-small")
inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
outputs = model(inputs)

# pytorch
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
model = AutoModel.from_pretrained("readerbench/RoBERT-small")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)

训练数据

该模型使用以下语料库进行训练。请注意,我们在清理过程后呈现统计数据。

Corpus Words Sentences Size (GB)
Oscar 1.78B 87M 10.8
RoTex 240M 14M 1.5
RoWiki 50M 2M 0.3
Total 2.07B 103M 12.6

下游性能

情感分析

我们报告宏平均F1分数(以%表示)

Model Dev Test
multilingual-BERT 68.96 69.57
XLM-R-base 71.26 71.71
BERT-base-ro 70.49 71.02
RoBERT-small 66.32 66.37
RoBERT-base 70.89 71.61
RoBERT-large 72.48 72.11

摩尔多瓦语与罗马尼亚语方言和跨方言主题识别

我们报告在摩尔多瓦语与罗马尼亚语跨方言主题识别挑战赛中的结果,以宏平均F1分数(以%表示)。

Model Dialect Classification MD to RO RO to MD
2-CNN + SVM 93.40 65.09 75.21
Char+Word SVM 96.20 69.08 81.93
BiGRU 93.30 70.10 80.30
multilingual-BERT 95.34 68.76 78.24
XLM-R-base 96.28 69.93 82.28
BERT-base-ro 96.20 69.93 78.79
RoBERT-small 95.67 69.01 80.40
RoBERT-base 97.39 68.30 81.09
RoBERT-large 97.78 69.91 83.65

恢复音标

挑战可在此处找到 here 。我们报告官方测试集上的结果(以准确度表示)。

Model word level char level
BiLSTM 99.42 -
CharCNN 98.40 99.65
CharCNN + multilingual-BERT 99.72 99.94
CharCNN + XLM-R-base 99.76 99.95
CharCNN + BERT-base-ro 99.79 99.95
CharCNN + RoBERT-small 99.73 99.94
CharCNN + RoBERT-base 99.78 99.95
CharCNN + RoBERT-large 99.76 99.95

BibTeX条目和引用信息

@inproceedings{masala2020robert,
  title={RoBERT--A Romanian BERT Model},
  author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6626--6637},
  year={2020}
}