模型:

michaelfeil/ct2fast-m2m100_418M

英文

Ctranslate2实现快速推理

使用C++量化版本的facebook/m2m100_1.2B进行int8推理,可将推理速度提高2倍至8倍

安装命令:pip install hf_hub_ctranslate2>=1.0.3 ctranslate2>=3.13.0

from hf_hub_ctranslate2 import MultiLingualTranslatorCT2fromHfHub

model = MultiLingualTranslatorCT2fromHfHub(
    model_name_or_path="michaelfeil/ct2fast-m2m100_418M", device="cpu", compute_type="int8",
    tokenizer=AutoTokenizer.from_pretrained(f"facebook/m2m100_418M")
)

outputs = model.generate(
    ["How do you call a fast Flamingo?", "Wie geht es dir?"],
    src_lang=["en", "de"],
    tgt_lang=["de", "fr"]
)

对于设备为"cuda",使用compute_type=int8_float16

对于设备为"cpu",使用compute_type=int8

已转换5/13/23至Ctranslate2

export ORG="facebook"
export NAME="m2m100_418M"
ct2-transformers-converter --model "$ORG/$NAME" --copy_files .gitattributes README.md generation_config.json sentencepiece.bpe.model  special_tokens_map.json tokenizer_config.json vocab.json --quantization float16

另一种

import ctranslate2
import transformers

translator = ctranslate2.Translator("m2m100_418M")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/m2m100_418M")
tokenizer.src_lang = "en"

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world!"))
target_prefix = [tokenizer.lang_code_to_token["de"]]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

原始:M2M100 418M

M2M100是一个多语言编码器-解码器(seq-to-seq)模型,用于进行多对多多语言翻译。该模型在这个 paper 中进行了介绍,并首次在这个 this 存储库中发布。

该模型可直接在100种语言之间进行9,900个方向的翻译。为了将文本翻译为目标语言,需要将目标语言的id强制作为生成的第一个标记。可以通过将 forced_bos_token_id 参数传递给 generate 方法来实现将目标语言的id强制作为生成的第一个标记。

注意: M2M100Tokenizer 依赖 sentencepiece ,请确保在运行示例之前安装它。

要安装 sentencepiece,请运行 pip install sentencepiece

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."

# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a box of chocolate."

查看 model hub 查找更多经过优化的版本。

支持的语言

Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bashkir (ba), Belarusian (be), Bulgarian (bg), Bengali (bn), Breton (br), Bosnian (bs), Catalan; Valencian (ca), Cebuano (ceb), Czech (cs), Welsh (cy), Danish (da), German (de), Greeek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fulah (ff), Finnish (fi), French (fr), Western Frisian (fy), Irish (ga), Gaelic; Scottish Gaelic (gd), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi (hi), Croatian (hr), Haitian; Haitian Creole (ht), Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Iloko (ilo), Icelandic (is), Italian (it), Japanese (ja), Javanese (jv), Georgian (ka), Kazakh (kk), Central Khmer (km), Kannada (kn), Korean (ko), Luxembourgish; Letzeburgesch (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Latvian (lv), Malagasy (mg), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Burmese (my), Nepali (ne), Dutch; Flemish (nl), Norwegian (no), Northern Sotho (ns), Occitan (post 1500) (oc), Oriya (or), Panjabi; Punjabi (pa), Polish (pl), Pushto; Pashto (ps), Portuguese (pt), Romanian; Moldavian; Moldovan (ro), Russian (ru), Sindhi (sd), Sinhala; Sinhalese (si), Slovak (sk), Slovenian (sl), Somali (so), Albanian (sq), Serbian (sr), Swati (ss), Sundanese (su), Swedish (sv), Swahili (sw), Tamil (ta), Thai (th), Tagalog (tl), Tswana (tn), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yiddish (yi), Yoruba (yo), Chinese (zh), Zulu (zu)

BibTeX条目和引用信息

@misc{fan2020englishcentric,
      title={Beyond English-Centric Multilingual Machine Translation}, 
      author={Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Edouard Grave and Michael Auli and Armand Joulin},
      year={2020},
      eprint={2010.11125},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}