模型:

dumitrescustefan/bert-base-romanian-uncased-v1

英文

bert-base-romanian-uncased-v1

这是适用于罗马尼亚语的BERT基本、非大小写敏感模型,使用了一份15GB的语料库进行训练,版本号为

如何使用

from transformers import AutoTokenizer, AutoModel
import torch

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")

# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)

# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

记得对文本进行清洗!将s和带小勾的字母替换为逗号字母,方法如下:

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

因为该模型不支持带有小勾的s和t字母。如果不进行替换,性能会下降,因为出现了标记,并且每个单词的标记数增加。

评估

评估是在通用依赖(Universal Dependencies)的UPOS、XPOS和LAS上进行的,还包括基于 RONEC 的命名实体识别(NER)任务。更详细的测试及其详细结果请参考专门的 evaluation page 页面。

基准模型是 Multilingual BERT 模型bert-base-multilingual-(un)cased,因为撰写时该模型是唯一适用于罗马尼亚语的可用BERT模型。

Model UPOS XPOS NER LAS
bert-base-multilingual-uncased 97.65 95.72 83.91 87.65
bert-base-romanian-uncased-v1 98.18 96.84 85.26 89.61

语料库

该模型是在以下语料库上进行训练的(表中的统计数据是在清理后的基础上得出的):

Corpus Lines(M) Words(M) Chars(B) Size(GB)
OPUS 55.05 635.04 4.045 3.8
OSCAR 33.56 1725.82 11.411 11
Wikipedia 1.54 60.47 0.411 0.4
Total 90.15 2421.33 15.867 15.2

引用

如果您在研究论文中使用了该模型,请引用以下论文:

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

或者,在BibTeX中引用:

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}
Acknowledgements
  • 对于帮助我们完成v1.0 BERT模型预训练所需的计算,我们要感谢来自TurkuNLP的 Sampo Pyysalo 。他太棒了!