PTT5是在巴西葡萄牙语的大型网络页面语料库BrWac上预训练的T5模型,提高了T5在葡萄牙语句子相似度和蕴涵任务上的性能。它提供三种尺寸(小型、基础和大型)和两种词汇表(谷歌的T5原始词汇表和我们的词汇表,基于葡萄牙语维基百科进行训练)。
如需进一步信息或请求,请访问 PTT5 repository 。
Model | Size | #Params | Vocabulary |
---|---|---|---|
1233321 | small | 60M | Google's T5 |
1234321 | base | 220M | Google's T5 |
1235321 | large | 740M | Google's T5 |
1236321 | small | 60M | Portuguese |
1237321 (Recommended) | base | 220M | Portuguese |
1238321 | large | 740M | Portuguese |
# Tokenizer from transformers import T5Tokenizer # PyTorch (bare model, baremodel + language modeling head) from transformers import T5Model, T5ForConditionalGeneration # Tensorflow (bare model, baremodel + language modeling head) from transformers import TFT5Model, TFT5ForConditionalGeneration model_name = 'unicamp-dl/ptt5-base-portuguese-vocab' tokenizer = T5Tokenizer.from_pretrained(model_name) # PyTorch model_pt = T5ForConditionalGeneration.from_pretrained(model_name) # TensorFlow model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)
如果使用PTT5,请引用:
@article{ptt5_2020, title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data}, author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto}, journal={arXiv preprint arXiv:2008.09144}, year={2020} }