plT5 Small

plT5 模型是基于 T5 的语言模型，使用波兰语语料库进行训练。该模型针对 T5 原始的降噪目标进行了优化。

语料库

plT5 使用了适用于波兰语的六个不同语料库进行训练:

Corpus	Tokens	Documents
1232321	3243M	7.9M
1233321	2641M	7.0M
1234321	1357M	3.9M
1235321	1056M	1.1M
1236321	260M	1.4M
1237321	41M	5.5k

分词器

训练数据集使用了一个词句分词模型进行子词分词，词汇量为 50k 个标记。

用法

示例代码:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-small")
model = AutoModel.from_pretrained("allegro/plt5-small")

许可证

CC BY 4.0

引用

如果您使用了该模型，请引用以下论文:

@article{chrabrowa2022evaluation,
  title={Evaluation of Transfer Learning for Polish with a Text-to-Text Model},
  author={Chrabrowa, Aleksandra and Dragan, {\L}ukasz and Grzegorczyk, Karol and Kajtoch, Dariusz and Koszowski, Miko{\l}aj and Mroczkowski, Robert and Rybak, Piotr},
  journal={arXiv preprint arXiv:2205.08808},
  year={2022}
}

作者

模型的训练由 Machine Learning Research Team at Allegro 和 Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences 完成。

您可以通过 klejbenchmark@allegro.pl 联系我们。

作者:

Allegro ML Research

数据集大小:

364.74 MB