中文

T5-EN-VI-SMALL:Pretraining Text-To-Text Transfer Transformer for English Vietnamese Translation

Dataset

The IWSLT'15 English-Vietnamese data is used from Stanford NLP group . For all experiments the corpus was split into training, development and test set:

Data set Sentences Download
Training 133,317 via GitHub or located in data/train-en-vi.tgz
Development 1,553 via GitHub or located in data/dev-2012-en-vi.tgz
Test 1,268 via GitHub or located in data/test-2013-en-vi.tgz

Results

The results on test set.

Model BLEU (Beam Search)
Luong & Manning (2015) 23.30
Sequence-to-sequence model with attention 26.10
Neural Phrase-based Machine Translation Huang et. al. (2017) 27.69
Neural Phrase-based Machine Translation + LM Huang et. al. (2017) 28.07
t5-en-vi-small (pretraining, without training data) 28.46 (cased) / 29.23 (uncased)
t5-en-vi-small (fineturning with training data) 32.38 (cased) / 33.19 (uncased)
Example Using
import torch

from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
if torch.cuda.is_available():       
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

model = T5ForConditionalGeneration.from_pretrained("NlpHUST/t5-en-vi-small")
tokenizer = T5Tokenizer.from_pretrained("NlpHUST/t5-en-vi-small")
model.to(device)

src = "In school , we spent a lot of time studying the history of Kim Il-Sung , but we never learned much about the outside world , except that America , South Korea , Japan are the enemies ."
tokenized_text = tokenizer.encode(src, return_tensors="pt").to(device)
model.eval()
summary_ids = model.generate(
                    tokenized_text,
                    max_length=128, 
                    num_beams=5,
                    repetition_penalty=2.5, 
                    length_penalty=1.0, 
                    early_stopping=True
                )
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(output)
Output
Ở trường, chúng tôi dành nhiều thời gian để nghiên cứu về lịch sử Kim Il-Sung, nhưng chúng tôi chưa bao giờ học được nhiều về thế giới bên ngoài, ngoại trừ Mỹ, Hàn Quốc, Nhật Bản là kẻ thù.

Contact information

For personal communication related to this project, please contact Nha Nguyen Van ( nha282@gmail.com ).