PTT5是在BrWac语料库中预训练的T5模型,该语料库是一大批葡萄牙语网页,提高了T5在葡萄牙语句子相似性和蕴含任务上的性能。它有三个大小(小、基础和大)和两个词汇表(Google的T5原版和我们的,训练于葡萄牙语维基百科)。
如需进一步信息或提出请求,请访问 PTT5 repository 。
Model | Size | #Params | Vocabulary |
---|---|---|---|
1233321 | small | 60M | Google's T5 |
1234321 | base | 220M | Google's T5 |
1235321 | large | 740M | Google's T5 |
1236321 | small | 60M | Portuguese |
1237321 (Recommended) | base | 220M | Portuguese |
1238321 | large | 740M | Portuguese |
# Tokenizer from transformers import T5Tokenizer # PyTorch (bare model, baremodel + language modeling head) from transformers import T5Model, T5ForConditionalGeneration # Tensorflow (bare model, baremodel + language modeling head) from transformers import TFT5Model, TFT5ForConditionalGeneration model_name = 'unicamp-dl/ptt5-base-portuguese-vocab' tokenizer = T5Tokenizer.from_pretrained(model_name) # PyTorch model_pt = T5ForConditionalGeneration.from_pretrained(model_name) # TensorFlow model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)
如果您使用PTT5,请引用:
@article{ptt5_2020, title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data}, author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto}, journal={arXiv preprint arXiv:2008.09144}, year={2020} }