bert-base-romanian-uncased-v1

这是适用于罗马尼亚语的BERT基本、非大小写敏感模型，使用了一份15GB的语料库进行训练，版本号为

如何使用

from transformers import AutoTokenizer, AutoModel
import torch

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")

# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)

# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

记得对文本进行清洗！将s和带小勾的字母替换为逗号字母，方法如下：

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

因为该模型不支持带有小勾的s和t字母。如果不进行替换，性能会下降，因为出现了标记，并且每个单词的标记数增加。

评估

评估是在通用依赖（Universal Dependencies）的UPOS、XPOS和LAS上进行的，还包括基于 RONEC 的命名实体识别（NER）任务。更详细的测试及其详细结果请参考专门的 evaluation page 页面。

基准模型是 Multilingual BERT 模型bert-base-multilingual-(un)cased，因为撰写时该模型是唯一适用于罗马尼亚语的可用BERT模型。

Model	UPOS	XPOS	NER	LAS
bert-base-multilingual-uncased	97.65	95.72	83.91	87.65
bert-base-romanian-uncased-v1	98.18	96.84	85.26	89.61

语料库

该模型是在以下语料库上进行训练的（表中的统计数据是在清理后的基础上得出的）：

Corpus	Lines(M)	Words(M)	Chars(B)	Size(GB)
OPUS	55.05	635.04	4.045	3.8
OSCAR	33.56	1725.82	11.411	11
Wikipedia	1.54	60.47	0.411	0.4
Total	90.15	2421.33	15.867	15.2

引用

如果您在研究论文中使用了该模型，请引用以下论文：

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

或者，在BibTeX中引用：

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}

Acknowledgements

对于帮助我们完成v1.0 BERT模型预训练所需的计算，我们要感谢来自TurkuNLP的 Sampo Pyysalo 。他太棒了！

作者:

Dumitrescu Stefan

数据集大小:

952.31 MB