英文

T5模型用于将复杂句子分成简单句子(英文)

Split-and-rephrase是将复杂输入句子分成较短句子而保持其意义的任务。(Narayan et al., 2017)

例如:

Cystic Fibrosis (CF) is an autosomal recessive disorder that affects multiple organs,
which is common in the Caucasian population, symptomatically affecting 1 in 2500 newborns in the UK,
and more than 80,000 individuals globally.

可以分解成

Cystic Fibrosis is an autosomal recessive disorder that affects multiple organs. 
Cystic Fibrosis is common in the Caucasian population.
Cystic Fibrosis affects 1 in 2500 newborns in the UK. 
Cystic Fibrosis affects more than 80,000 individuals globally.

如何在代码中使用它:

from transformers import T5Tokenizer, T5ForConditionalGeneration
checkpoint="unikei/t5-base-split-and-rephrase"
tokenizer = T5Tokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

complex_sentence = "Cystic Fibrosis (CF) is an autosomal recessive disorder that \
affects multiple organs, which is common in the Caucasian \
population, symptomatically affecting 1 in 2500 newborns in \
the UK, and more than 80,000 individuals globally."
complex_tokenized = tokenizer(complex_sentence, 
                                 padding="max_length", 
                                 truncation=True,
                                 max_length=256, 
                                 return_tensors='pt')

simple_tokenized = model.generate(complex_tokenized['input_ids'], attention_mask = complex_tokenized['attention_mask'], max_length=256, num_beams=5)
simple_sentences = tokenizer.batch_decode(simple_tokenized, skip_special_tokens=True)
print(simple_sentences)

"""
Output:
Cystic Fibrosis is an autosomal recessive disorder that affects multiple organs. Cystic Fibrosis affects 1 in 2500 newborns in the UK. Cystic Fibrosis affects more than 80,000 individuals globally. Cystic Fibrosis is common in the Caucasian population.
"""