这是适用于罗马尼亚语的BERT基本、非大小写敏感模型,使用了一份15GB的语料库进行训练,版本号为
from transformers import AutoTokenizer, AutoModel import torch # load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True) model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1") # tokenize a sentence and run through the model input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1 outputs = model(input_ids) # get encoding last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
记得对文本进行清洗!将s和带小勾的字母替换为逗号字母,方法如下:
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
因为该模型不支持带有小勾的s和t字母。如果不进行替换,性能会下降,因为出现了标记,并且每个单词的标记数增加。
评估是在通用依赖(Universal Dependencies)的UPOS、XPOS和LAS上进行的,还包括基于 RONEC 的命名实体识别(NER)任务。更详细的测试及其详细结果请参考专门的 evaluation page 页面。
基准模型是 Multilingual BERT 模型bert-base-multilingual-(un)cased,因为撰写时该模型是唯一适用于罗马尼亚语的可用BERT模型。
Model | UPOS | XPOS | NER | LAS |
---|---|---|---|---|
bert-base-multilingual-uncased | 97.65 | 95.72 | 83.91 | 87.65 |
bert-base-romanian-uncased-v1 | 98.18 | 96.84 | 85.26 | 89.61 |
该模型是在以下语料库上进行训练的(表中的统计数据是在清理后的基础上得出的):
Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
---|---|---|---|---|
OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
Total | 90.15 | 2421.33 | 15.867 | 15.2 |
如果您在研究论文中使用了该模型,请引用以下论文:
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
或者,在BibTeX中引用:
@inproceedings{dumitrescu-etal-2020-birth, title = "The birth of {R}omanian {BERT}", author = "Dumitrescu, Stefan and Avram, Andrei-Marius and Pyysalo, Sampo", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.findings-emnlp.387", doi = "10.18653/v1/2020.findings-emnlp.387", pages = "4324--4328", }Acknowledgements