模型:
allegro/herbert-large-cased
HerBERT 是一个基于波兰语语料库训练的BERT-based语言模型,使用了掩码语言模型(MLM)和句子结构目标(SSO)以动态掩码整个单词。详细信息,请参考: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish 。
模型训练和实验使用了 transformers 的2.9版本。
HerBERT是基于六个不同的波兰语语料库进行训练的:
Corpus | Tokens | Documents |
---|---|---|
1235321 | 3243M | 7.9M |
1236321 | 2641M | 7.0M |
1237321 | 1357M | 3.9M |
1238321 | 1056M | 1.1M |
1239321 | 260M | 1.4M |
12310321 | 41M | 5.5k |
训练数据集使用了字符级字节对编码(CharBPETokenizer)进行分词处理,词汇表大小为50k个标记。标记器本身是使用 tokenizers 库训练的。
我们建议您使用快速版本的标记器,即HerbertTokenizerFast。
示例代码:
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-large-cased") model = AutoModel.from_pretrained("allegro/herbert-large-cased") output = model( **tokenizer.batch_encode_plus( [ ( "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.", "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy." ) ], padding='longest', add_special_tokens=True, return_tensors='pt' ) )
CC BY 4.0
如果您使用此模型,请引用以下论文:
@inproceedings{mroczkowski-etal-2021-herbert, title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish", author = "Mroczkowski, Robert and Rybak, Piotr and Wr{\'o}blewska, Alina and Gawlik, Ireneusz", booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing", month = apr, year = "2021", address = "Kiyv, Ukraine", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1", pages = "1--10", }
模型由 Machine Learning Research Team at Allegro 和 Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences 进行训练。
您可以通过电子邮件联系我们:klejbenchmark@allegro.pl