BERT基础模型,针对罗马尼亚语,经过15GB语料库训练,版本为
   
   
from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 请记住始终对文本进行清理!使用冒号将s和t带有cedilla-letters的字母替换为逗号-letters
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
 因为该模型未经过cedilla s和t s的训练。如果不进行替换,性能会降低,出现<UNK> s,并且每个单词的标记数会增加。
评估在 Universal Dependencies 的UPOS、XPOS和LAS上进行,并且基于 RONEC 进行命名实体识别任务。详细信息以及此处未显示的更深入的测试结果在专用的 evaluation page 中给出。
基准模型是 Multilingual BERT 模型bert-base-multilingual-(un)cased,因为在撰写本文时,它是唯一可用于罗马尼亚语的BERT模型。
| Model | UPOS | XPOS | NER | LAS | 
|---|---|---|---|---|
| bert-base-multilingual-cased | 97.87 | 96.16 | 84.13 | 88.04 | 
| bert-base-romanian-cased-v1 | 98.00 | 96.46 | 85.88 | 89.69 | 
该模型是在以下语料库上进行训练的(表中的统计数据是在清理后的基础上得出的)
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) | 
|---|---|---|---|---|
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 | 
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 | 
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 | 
| Total | 90.15 | 2421.33 | 15.867 | 15.2 | 
如果您在研究论文中使用此模型,请引用以下论文
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
或者,在bibtex中引用
@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}
 Acknowledgements