英文

Portuguese T5 (也被称为 "PTT5")

简介

PTT5 是在 BrWac 语料库中预训练的 T5 模型,该语料库是一组包含大量葡萄牙语网页的集合,提高了 T5 在葡萄牙语句子相似性和蕴含任务上的性能。它有三种大小可选(小型、基础和大型),并且有两个词汇表可供选择(Google 的 T5 原始词汇表和我们在葡萄牙语维基百科上训练的词汇表)。

如需更多信息或提出请求,请访问 PTT5 repository

可用模型

Model Size #Params Vocabulary
1233321 small 60M Google's T5
1234321 base 220M Google's T5
1235321 large 740M Google's T5
1236321 small 60M Portuguese
1237321 (Recommended) base 220M Portuguese
1238321 large 740M Portuguese

用法

# Tokenizer 
from transformers import T5Tokenizer

# PyTorch (bare model, baremodel + language modeling head)
from transformers import T5Model, T5ForConditionalGeneration

# Tensorflow (bare model, baremodel + language modeling head)
from transformers import TFT5Model, TFT5ForConditionalGeneration

model_name = 'unicamp-dl/ptt5-base-portuguese-vocab'

tokenizer = T5Tokenizer.from_pretrained(model_name)

# PyTorch
model_pt = T5ForConditionalGeneration.from_pretrained(model_name)

# TensorFlow
model_tf = TFT5ForConditionalGeneration.from_pretrained(model_name)

引用

如果您使用 PTT5,请引用:

@article{ptt5_2020,
  title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
  author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
  journal={arXiv preprint arXiv:2008.09144},
  year={2020}
}