PTT5是在葡萄牙语BrWac语料库中预训练的T5模型,该语料库是葡萄牙语的大型网页集合,可以提高T5在葡萄牙语句子相似性和蕴含任务上的性能。它有三种大小可选(小型、基础和大型)和两种词汇表(Google的T5原版和我们的词汇表,该词汇表是在葡萄牙语维基百科上训练的)。
如需更多信息或请求,请访问 PTT5 repository 。
Model | Size | #Params | Vocabulary |
---|---|---|---|
1233321 | small | 60M | Google's T5 |
1234321 | base | 220M | Google's T5 |
1235321 | large | 740M | Google's T5 |
1236321 | small | 60M | Portuguese |
1237321 (Recommended) | base | 220M | Portuguese |
1238321 | large | 740M | Portuguese |
# Tokenizer from transformers import T5Tokenizer # PyTorch (bare model, baremodel + language modeling head) from transformers import T5Model, T5ForConditionalGeneration # Tensorflow (bare model, baremodel + language modeling head) from transformers import TFT5Model, TFT5ForConditionalGeneration model_name = 'unicamp-dl/ptt5-base-portuguese-vocab' tokenizer = T5Tokenizer.from_pretrained(model_name) # PyTorch model_pt = T5ForConditionalGeneration.from_pretrained(model_name) # TensorFlow model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)
如果您使用了PTT5,请引用:
@article{ptt5_2020, title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data}, author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto}, journal={arXiv preprint arXiv:2008.09144}, year={2020} }