英文

PT T5(又名“PTT5”)

简介

PTT5是在巴西葡萄牙语的大型网络页面语料库BrWac上预训练的T5模型,提高了T5在葡萄牙语句子相似度和蕴涵任务上的性能。它提供三种尺寸(小型、基础和大型)和两种词汇表(谷歌的T5原始词汇表和我们的词汇表,基于葡萄牙语维基百科进行训练)。

如需进一步信息或请求,请访问 PTT5 repository

可用模型

Model Size #Params Vocabulary
1233321 small 60M Google's T5
1234321 base 220M Google's T5
1235321 large 740M Google's T5
1236321 small 60M Portuguese
1237321 (Recommended) base 220M Portuguese
1238321 large 740M Portuguese

使用方法

# Tokenizer 
from transformers import T5Tokenizer

# PyTorch (bare model, baremodel + language modeling head)
from transformers import T5Model, T5ForConditionalGeneration

# Tensorflow (bare model, baremodel + language modeling head)
from transformers import TFT5Model, TFT5ForConditionalGeneration

model_name = 'unicamp-dl/ptt5-base-portuguese-vocab'

tokenizer = T5Tokenizer.from_pretrained(model_name)

# PyTorch
model_pt = T5ForConditionalGeneration.from_pretrained(model_name)

# TensorFlow
model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)

引用文献

如果使用PTT5,请引用:

@article{ptt5_2020,
  title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
  author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
  journal={arXiv preprint arXiv:2008.09144},
  year={2020}
}