神经机器翻译模型,用于将西日耳曼语言 (gmw) 翻译为西日耳曼语言 (gmw)。
该模型是 OPUS-MT project 的一部分,旨在使神经机器翻译模型在世界上众多语言中得到广泛使用和访问。所有模型最初都是使用令人惊叹的 Marian NMT 框架进行训练的,这是一个纯C++编写的高效NMT实现。使用transformers库由huggingface将模型转换为pyTorch。训练数据来自 OPUS ,训练流程使用了 OPUS-MT-train 的方法。 模型描述:
这是一个多语言翻译模型,具有多个目标语言。您需要提供一个句子初始的语言令牌,以形式 >>id<< (id = 有效的目标语言ID),例如 >>afr<<
该模型可用于翻译和文本生成。
内容警告:读者应意识到该模型是根据可能包含令人不安、冒犯性内容的各种公共数据集进行训练的,并且可能传播历史和现实的刻板印象。
大量研究已探讨了语言模型的偏见和公平性问题(参见,例如, Sheng et al. (2021) 和 Bender et al. (2021) )。
一个简短的代码示例:
from transformers import MarianMTModel, MarianTokenizer src_text = [ ">>nds<< Red keinen Quatsch.", ">>eng<< Findet ihr das nicht etwas übereilt?" ] model_name = "pytorch-models/opus-mt-tc-big-gmw-gmw" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) for t in translated: print( tokenizer.decode(t, skip_special_tokens=True) ) # expected output: # Kiek ok bi: Rott. # Aren't you in a hurry?
您还可以使用transformers pipeline来使用OPUS-MT模型,例如:
from transformers import pipeline pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-gmw-gmw") print(pipe(">>nds<< Red keinen Quatsch.")) # expected output: Kiek ok bi: Rott.
langpair | testset | chr-F | BLEU | #sent | #words |
---|---|---|---|---|---|
afr-deu | tatoeba-test-v2021-08-07 | 0.68679 | 50.4 | 1583 | 9105 |
afr-eng | tatoeba-test-v2021-08-07 | 0.70682 | 56.6 | 1374 | 9622 |
afr-nld | tatoeba-test-v2021-08-07 | 0.71516 | 55.5 | 1056 | 6710 |
deu-afr | tatoeba-test-v2021-08-07 | 0.70274 | 54.3 | 1583 | 9507 |
deu-eng | tatoeba-test-v2021-08-07 | 0.66023 | 48.6 | 17565 | 149462 |
deu-nds | tatoeba-test-v2021-08-07 | 0.48058 | 23.2 | 9999 | 76137 |
deu-nld | tatoeba-test-v2021-08-07 | 0.71440 | 54.6 | 10218 | 75235 |
deu-yid | tatoeba-test-v2021-08-07 | 9.211 | 0.4 | 853 | 5355 |
eng-afr | tatoeba-test-v2021-08-07 | 0.71995 | 56.5 | 1374 | 10317 |
eng-deu | tatoeba-test-v2021-08-07 | 0.63103 | 42.0 | 17565 | 151568 |
eng-nld | tatoeba-test-v2021-08-07 | 0.71062 | 54.5 | 12696 | 91796 |
eng-yid | tatoeba-test-v2021-08-07 | 9.624 | 0.4 | 2483 | 16395 |
fry-eng | tatoeba-test-v2021-08-07 | 0.40545 | 25.1 | 220 | 1573 |
fry-nld | tatoeba-test-v2021-08-07 | 0.55771 | 41.7 | 260 | 1854 |
gos-deu | tatoeba-test-v2021-08-07 | 0.45302 | 25.4 | 207 | 1168 |
gos-eng | tatoeba-test-v2021-08-07 | 0.37628 | 24.1 | 1154 | 5635 |
gos-nld | tatoeba-test-v2021-08-07 | 0.45777 | 26.2 | 1852 | 9903 |
ltz-deu | tatoeba-test-v2021-08-07 | 0.37165 | 21.3 | 347 | 2208 |
ltz-eng | tatoeba-test-v2021-08-07 | 0.37784 | 30.3 | 293 | 1840 |
ltz-nld | tatoeba-test-v2021-08-07 | 0.32823 | 26.7 | 292 | 1685 |
nds-deu | tatoeba-test-v2021-08-07 | 0.64008 | 45.4 | 9999 | 74564 |
nds-eng | tatoeba-test-v2021-08-07 | 0.55193 | 38.3 | 2500 | 17589 |
nds-nld | tatoeba-test-v2021-08-07 | 0.66943 | 50.0 | 1657 | 11490 |
nld-afr | tatoeba-test-v2021-08-07 | 0.76610 | 62.3 | 1056 | 6823 |
nld-deu | tatoeba-test-v2021-08-07 | 0.73162 | 56.8 | 10218 | 74131 |
nld-eng | tatoeba-test-v2021-08-07 | 0.74088 | 60.5 | 12696 | 89978 |
nld-fry | tatoeba-test-v2021-08-07 | 0.48460 | 31.4 | 260 | 1857 |
nld-nds | tatoeba-test-v2021-08-07 | 0.43779 | 19.9 | 1657 | 11711 |
swg-deu | tatoeba-test-v2021-08-07 | 0.40348 | 16.1 | 1523 | 15632 |
yid-deu | tatoeba-test-v2021-08-07 | 6.305 | 0.1 | 853 | 5173 |
yid-eng | tatoeba-test-v2021-08-07 | 3.704 | 0.1 | 2483 | 15452 |
afr-deu | flores101-devtest | 0.58718 | 30.2 | 1012 | 25094 |
afr-eng | flores101-devtest | 0.74826 | 55.1 | 1012 | 24721 |
afr-ltz | flores101-devtest | 0.46826 | 15.7 | 1012 | 25087 |
afr-nld | flores101-devtest | 0.54441 | 22.5 | 1012 | 25467 |
deu-afr | flores101-devtest | 0.57835 | 26.4 | 1012 | 25740 |
deu-eng | flores101-devtest | 0.66990 | 41.8 | 1012 | 24721 |
deu-ltz | flores101-devtest | 0.52554 | 20.3 | 1012 | 25087 |
deu-nld | flores101-devtest | 0.55710 | 24.2 | 1012 | 25467 |
eng-afr | flores101-devtest | 0.68429 | 40.7 | 1012 | 25740 |
eng-deu | flores101-devtest | 0.64888 | 38.5 | 1012 | 25094 |
eng-ltz | flores101-devtest | 0.49231 | 18.4 | 1012 | 25087 |
eng-nld | flores101-devtest | 0.57984 | 26.8 | 1012 | 25467 |
ltz-afr | flores101-devtest | 0.53623 | 23.2 | 1012 | 25740 |
ltz-deu | flores101-devtest | 0.59122 | 30.0 | 1012 | 25094 |
ltz-eng | flores101-devtest | 0.57557 | 31.0 | 1012 | 24721 |
ltz-nld | flores101-devtest | 0.49312 | 18.6 | 1012 | 25467 |
nld-afr | flores101-devtest | 0.52409 | 20.0 | 1012 | 25740 |
nld-deu | flores101-devtest | 0.53898 | 22.6 | 1012 | 25094 |
nld-eng | flores101-devtest | 0.58970 | 30.7 | 1012 | 24721 |
nld-ltz | flores101-devtest | 0.42637 | 11.8 | 1012 | 25087 |
deu-eng | multi30k_test_2016_flickr | 0.60928 | 39.9 | 1000 | 12955 |
eng-deu | multi30k_test_2016_flickr | 0.64172 | 35.4 | 1000 | 12106 |
deu-eng | multi30k_test_2017_flickr | 0.63154 | 40.5 | 1000 | 11374 |
eng-deu | multi30k_test_2017_flickr | 0.63078 | 34.2 | 1000 | 10755 |
deu-eng | multi30k_test_2017_mscoco | 0.55708 | 32.2 | 461 | 5231 |
eng-deu | multi30k_test_2017_mscoco | 0.57537 | 29.1 | 461 | 5158 |
deu-eng | multi30k_test_2018_flickr | 0.59422 | 36.9 | 1071 | 14689 |
eng-deu | multi30k_test_2018_flickr | 0.59597 | 30.0 | 1071 | 13703 |
deu-eng | newssyscomb2009 | 0.54993 | 28.2 | 502 | 11818 |
eng-deu | newssyscomb2009 | 0.53867 | 23.2 | 502 | 11271 |
deu-eng | news-test2008 | 0.54601 | 27.2 | 2051 | 49380 |
eng-deu | news-test2008 | 0.53149 | 23.6 | 2051 | 47447 |
deu-eng | newstest2009 | 0.53747 | 25.9 | 2525 | 65399 |
eng-deu | newstest2009 | 0.53283 | 22.9 | 2525 | 62816 |
deu-eng | newstest2010 | 0.58355 | 30.6 | 2489 | 61711 |
eng-deu | newstest2010 | 0.54885 | 25.8 | 2489 | 61503 |
deu-eng | newstest2011 | 0.54883 | 26.3 | 3003 | 74681 |
eng-deu | newstest2011 | 0.52712 | 23.1 | 3003 | 72981 |
deu-eng | newstest2012 | 0.56153 | 28.5 | 3003 | 72812 |
eng-deu | newstest2012 | 0.52662 | 23.3 | 3003 | 72886 |
deu-eng | newstest2013 | 0.57770 | 31.4 | 3000 | 64505 |
eng-deu | newstest2013 | 0.55774 | 27.8 | 3000 | 63737 |
deu-eng | newstest2014 | 0.59826 | 33.2 | 3003 | 67337 |
eng-deu | newstest2014 | 0.59301 | 29.0 | 3003 | 62688 |
deu-eng | newstest2015 | 0.59660 | 33.4 | 2169 | 46443 |
eng-deu | newstest2015 | 0.59889 | 32.3 | 2169 | 44260 |
deu-eng | newstest2016 | 0.64736 | 39.8 | 2999 | 64119 |
eng-deu | newstest2016 | 0.64427 | 38.3 | 2999 | 62669 |
deu-eng | newstest2017 | 0.60933 | 35.2 | 3004 | 64399 |
eng-deu | newstest2017 | 0.59257 | 30.7 | 3004 | 61287 |
deu-eng | newstest2018 | 0.66797 | 42.6 | 2998 | 67012 |
eng-deu | newstest2018 | 0.69605 | 46.5 | 2998 | 64276 |
deu-eng | newstest2019 | 0.63749 | 39.7 | 2000 | 39227 |
eng-deu | newstest2019 | 0.66751 | 42.9 | 1997 | 48746 |
deu-eng | newstest2020 | 0.61200 | 35.0 | 785 | 38220 |
eng-deu | newstest2020 | 0.60411 | 32.3 | 1418 | 52383 |
deu-eng | newstestB2020 | 0.61255 | 35.1 | 785 | 37696 |
eng-deu | newstestB2020 | 0.59513 | 31.8 | 1418 | 53092 |
@inproceedings{tiedemann-thottingal-2020-opus, title = "{OPUS}-{MT} {--} Building open translation services for the World", author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh}, booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation", month = nov, year = "2020", address = "Lisboa, Portugal", publisher = "European Association for Machine Translation", url = "https://aclanthology.org/2020.eamt-1.61", pages = "479--480", } @inproceedings{tiedemann-2020-tatoeba, title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}", author = {Tiedemann, J{\"o}rg}, booktitle = "Proceedings of the Fifth Conference on Machine Translation", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.wmt-1.139", pages = "1174--1182", }
该工作得到 European Language Grid 的支持,作为 pilot project 2866 ,由欧洲研究委员会(ERC)在欧洲联盟Horizon 2020研究和创新计划(授权协议编号:771113)下资助;以及 FoTran project ,由欧洲联盟Horizon 2020研究和创新计划(授权协议编号:780069)资助。我们还感谢 CSC -- IT Center for Science 提供的慷慨计算资源和IT基础设施,芬兰。