plT5 Large

plT5模型是在波兰语语料库上训练的基于T5的语言模型。模型针对原始的T5去噪目标进行了优化。

语料库

plT5模型是使用六个针对波兰语的语料库进行训练的。

Corpus	Tokens	Documents
1232321	3243M	7.9M
1233321	2641M	7.0M
1234321	1357M	3.9M
1235321	1056M	1.1M
1236321	260M	1.4M
1237321	41M	5.5k

分词器

训练数据集使用sentencepiece unigram模型进行了子词分词，词汇表大小为50k个token。

使用方法

示例代码：

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-large")
model = AutoModel.from_pretrained("allegro/plt5-large")

许可协议

CC BY 4.0

引用文献

如果您使用了该模型，请引用以下论文：

@article{chrabrowa2022evaluation,
  title={Evaluation of Transfer Learning for Polish with a Text-to-Text Model},
  author={Chrabrowa, Aleksandra and Dragan, {\L}ukasz and Grzegorczyk, Karol and Kajtoch, Dariusz and Koszowski, Miko{\l}aj and Mroczkowski, Robert and Rybak, Piotr},
  journal={arXiv preprint arXiv:2205.08808},
  year={2022}
}

作者

该模型是由 Machine Learning Research Team at Allegro 和 Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences 训练的。

您可以通过以下方式联系我们：klejbenchmark@allegro.pl

作者:

Allegro ML Research

数据集大小:

3.06 GB