bert-base-romanian-cased-v1

BERT基础模型，针对罗马尼亚语，经过15GB语料库训练，版本为

如何使用

from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

请记住始终对文本进行清理！使用冒号将s和t带有cedilla-letters的字母替换为逗号-letters

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

因为该模型未经过cedilla s和t s的训练。如果不进行替换，性能会降低，出现<UNK> s，并且每个单词的标记数会增加。

评估

评估在 Universal Dependencies 的UPOS、XPOS和LAS上进行，并且基于 RONEC 进行命名实体识别任务。详细信息以及此处未显示的更深入的测试结果在专用的 evaluation page 中给出。

基准模型是 Multilingual BERT 模型bert-base-multilingual-(un)cased，因为在撰写本文时，它是唯一可用于罗马尼亚语的BERT模型。

Model	UPOS	XPOS	NER	LAS
bert-base-multilingual-cased	97.87	96.16	84.13	88.04
bert-base-romanian-cased-v1	98.00	96.46	85.88	89.69

语料库

该模型是在以下语料库上进行训练的（表中的统计数据是在清理后的基础上得出的）

Corpus	Lines(M)	Words(M)	Chars(B)	Size(GB)
OPUS	55.05	635.04	4.045	3.8
OSCAR	33.56	1725.82	11.411	11
Wikipedia	1.54	60.47	0.411	0.4
Total	90.15	2421.33	15.867	15.2

引文

如果您在研究论文中使用此模型，请引用以下论文

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

或者，在bibtex中引用

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}

Acknowledgements

我们要感谢TurkuNLP的 Sampo Pyysalo 提供了预训练v1.0 BERT模型所需的计算资源。他非常棒！

作者:

Dumitrescu Stefan

数据集大小:

952.3 MB