模型:
michaelfeil/ct2fast-m2m100_418M
使用C++量化版本的facebook/m2m100_1.2B进行int8推理,可将推理速度提高2倍至8倍
安装命令:pip install hf_hub_ctranslate2>=1.0.3 ctranslate2>=3.13.0
from hf_hub_ctranslate2 import MultiLingualTranslatorCT2fromHfHub model = MultiLingualTranslatorCT2fromHfHub( model_name_or_path="michaelfeil/ct2fast-m2m100_418M", device="cpu", compute_type="int8", tokenizer=AutoTokenizer.from_pretrained(f"facebook/m2m100_418M") ) outputs = model.generate( ["How do you call a fast Flamingo?", "Wie geht es dir?"], src_lang=["en", "de"], tgt_lang=["de", "fr"] )
对于设备为"cuda",使用compute_type=int8_float16
对于设备为"cpu",使用compute_type=int8
已转换5/13/23至Ctranslate2
export ORG="facebook" export NAME="m2m100_418M" ct2-transformers-converter --model "$ORG/$NAME" --copy_files .gitattributes README.md generation_config.json sentencepiece.bpe.model special_tokens_map.json tokenizer_config.json vocab.json --quantization float16
另一种
import ctranslate2 import transformers translator = ctranslate2.Translator("m2m100_418M") tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/m2m100_418M") tokenizer.src_lang = "en" source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!")) target_prefix = [tokenizer.lang_code_to_token["de"]] results = translator.translate_batch([source], target_prefix=[target_prefix]) target = results[0].hypotheses[0][1:] print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
M2M100是一个多语言编码器-解码器(seq-to-seq)模型,用于进行多对多多语言翻译。该模型在这个 paper 中进行了介绍,并首次在这个 this 存储库中发布。
该模型可直接在100种语言之间进行9,900个方向的翻译。为了将文本翻译为目标语言,需要将目标语言的id强制作为生成的第一个标记。可以通过将 forced_bos_token_id 参数传递给 generate 方法来实现将目标语言的id强制作为生成的第一个标记。
注意: M2M100Tokenizer 依赖 sentencepiece ,请确保在运行示例之前安装它。
要安装 sentencepiece,请运行 pip install sentencepiece
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" chinese_text = "生活就像一盒巧克力。" model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M") tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M") # translate Hindi to French tokenizer.src_lang = "hi" encoded_hi = tokenizer(hi_text, return_tensors="pt") generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr")) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "La vie est comme une boîte de chocolat." # translate Chinese to English tokenizer.src_lang = "zh" encoded_zh = tokenizer(chinese_text, return_tensors="pt") generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "Life is like a box of chocolate."
查看 model hub 查找更多经过优化的版本。
Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)
@misc{fan2020englishcentric, title={Beyond English-Centric Multilingual Machine Translation}, author={Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Edouard Grave and Michael Auli and Armand Joulin}, year={2020}, eprint={2010.11125}, archivePrefix={arXiv}, primaryClass={cs.CL} }