模型:
facebook/m2m100-12B-avg-5-ckpt
M2M100是一个针对多对多多语言翻译的编码器-解码器(序列到序列)模型。它在 paper 中进行了介绍,并在 this 首次发布。
该模型可以直接在100种语言的9,900个方向之间进行翻译。为了翻译成目标语言,目标语言ID被强制作为第一个生成的标记。要强制将目标语言ID作为第一个生成的标记,请将 forced_bos_token_id 参数传递给 generate 方法。
注意:M2M100Tokenizer 依赖于 sentencepiece,请确保在运行示例之前安装它。
要安装 sentencepiece,请运行 pip install sentencepiece
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" chinese_text = "生活就像一盒巧克力。" model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100-12B-avg-5-ckpt") tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100-12B-avg-5-ckpt") # translate Hindi to French tokenizer.src_lang = "hi" encoded_hi = tokenizer(hi_text, return_tensors="pt") generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr")) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "La vie est comme une boîte de chocolat." # translate Chinese to English tokenizer.src_lang = "zh" encoded_zh = tokenizer(chinese_text, return_tensors="pt") generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "Life is like a box of chocolate."
查看 model hub 以获取更多经过优化的版本。
Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (1500年后) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)
@misc{fan2020englishcentric, title={Beyond English-Centric Multilingual Machine Translation}, author={Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Edouard Grave and Michael Auli and Armand Joulin}, year={2020}, eprint={2010.11125}, archivePrefix={arXiv}, primaryClass={cs.CL} }