多语言相似表达生成

该存储库包含在 IndicParaphrase 数据集的11种语言上微调的 IndicBART 检查点。有关微调详细信息，请参阅 paper 。

支持的语言：阿萨姆语、孟加拉语、古吉拉特语、印地语、马拉地语、奥里亚语、旁遮普语、卡纳达语、马拉雅拉姆语、泰米尔语和泰卢固语。并非所有这些语言都受到mBART50和mT5的支持。
模型比mBART和mT5(-base)模型要小得多，因此在解码时计算开销较小。
在大型印度语语料库（553万个句子）上进行了训练。
为了促进相关语言之间的迁移学习，所有语言都以天城文脚本表示。

在transformers中使用此模型

from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import AlbertTokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/MultiIndicParaphraseGeneration", do_lower_case=False, use_fast=False, keep_accents=True)
# Or use tokenizer = AlbertTokenizer.from_pretrained("ai4bharat/MultiIndicParaphraseGeneration", do_lower_case=False, use_fast=False, keep_accents=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/MultiIndicParaphraseGeneration")
# Or use model = MBartForConditionalGeneration.from_pretrained("ai4bharat/MultiIndicParaphraseGeneration")

# Some initial mapping
bos_id = tokenizer._convert_token_to_id_with_added_voc("<s>")
eos_id = tokenizer._convert_token_to_id_with_added_voc("</s>")
pad_id = tokenizer._convert_token_to_id_with_added_voc("<pad>")

# To get lang_id use any of ['<2as>', '<2bn>', '<2en>', '<2gu>', '<2hi>', '<2kn>', '<2ml>', '<2mr>', '<2or>', '<2pa>', '<2ta>', '<2te>']
# First tokenize the input. The format below is how IndicBART was trained so the input should be "Sentence </s> <2xx>" where xx is the language code. Similarly, the output should be "<2yy> Sentence </s>".
inp = tokenizer("दिल्ली यूनिवर्सिटी देश की प्रसिद्ध यूनिवर्सिटी में से एक है. </s> <2hi>", add_special_tokens=False, return_tensors="pt", padding=True).input_ids 

# For generation. Pardon the messiness. Note the decoder_start_token_id.

model_output=model.generate(inp, use_cache=True,no_repeat_ngram_size=3,encoder_no_repeat_ngram_size=3, num_beams=4, max_length=20, min_length=1, early_stopping=True, pad_token_id=pad_id, bos_token_id=bos_id, eos_token_id=eos_id, decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc("<2hi>"))

# Decode to get output strings
decoded_output=tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(decoded_output) # दिल्ली विश्वविद्यालय देश की प्रमुख विश्वविद्यालयों में शामिल है।

# Note that if your output language is not Hindi or Marathi, you should convert its script from Devanagari to the desired language using the Indic NLP Library.

注意：

如果您想使用非天城文脚本编写的任何语言，则应该先使用 Indic NLP Library 将其转换为天城文。在获得输出后，您应将其转换回原始脚本。

基准测试

IndicParaphrase测试集的分数如下：

Language	BLEU / Self-BLEU / iBLEU
as	1.66 / 2.06 / 0.54
bn	11.57 / 1.69 / 7.59
gu	22.10 / 2.76 / 14.64
hi	27.29 / 2.87 / 18.24
kn	15.40 / 2.98 / 9.89
ml	10.57 / 1.70 / 6.89
mr	20.38 / 2.20 / 13.61
or	19.26 / 2.10 / 12.85
pa	14.87 / 1.35 / 10.00
ta	18.52 / 2.88 / 12.10
te	16.70 / 3.34 / 10.69

引用

如果您使用此模型，请引用以下论文：

@inproceedings{Kumar2022IndicNLGSM,
  title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages},
  author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
  year={2022},
  url = "https://arxiv.org/abs/2203.05437"
  }

作者:

AI4Bharat

数据集大小:

933.01 MB