英文

opus-mt-tc-big-gmw-gmw

目录

  • 模型详情
  • 使用情况
  • 风险、限制和偏见
  • 如何开始使用模型
  • 训练
  • 评估
  • 引用信息
  • 致谢

模型详情

神经机器翻译模型,用于将西日耳曼语言 (gmw) 翻译为西日耳曼语言 (gmw)。

该模型是 OPUS-MT project 的一部分,旨在使神经机器翻译模型在世界上众多语言中得到广泛使用和访问。所有模型最初都是使用令人惊叹的 Marian NMT 框架进行训练的,这是一个纯C++编写的高效NMT实现。使用transformers库由huggingface将模型转换为pyTorch。训练数据来自 OPUS ,训练流程使用了 OPUS-MT-train 的方法。 模型描述:

  • 开发者:赫尔辛基大学语言技术研究小组
  • 模型类型:翻译(transformer-big)
  • 发布日期:2022-08-11
  • 许可证:CC-BY-4.0
  • 语言:
    • 源语言:afr deu eng enm fry gos gsw hrx ksh ltz nds nld pdc sco stq swg tpi yid
    • 目标语言:afr ang deu eng enm fry gos ltz nds nld sco tpi yid
    • 语言对:afr-deu afr-eng afr-nld deu-afr deu-eng deu-ltz deu-nds deu-nld eng-afr eng-deu eng-fry eng-nld fry-eng fry-nld gos-deu gos-eng gos-nld ltz-afr ltz-deu ltz-eng ltz-nld nds-deu nds-eng nds-nld nld-afr nld-deu nld-eng nld-fry
    • 有效的目标语言标签:>>act<< >>afr<< >>afs<< >>aig<< >>ang<< >>ang_Latn<< >>bah<< >>bar<< >>bis<< >>bjs<< >>brc<< >>bzj<< >>bzj_Latn<< >>bzk<< >>cim<< >>dcr<< >>deu<< >>djk<< >>djk_Latn<< >>drt<< >>drt_Latn<< >>dum<< >>eng<< >>enm<< >>enm_Latn<< >>fpe<< >>frk<< >>frr<< >>fry<< >>gcl<< >>gct<< >>geh<< >>gmh<< >>gml<< >>goh<< >>gos<< >>gpe<< >>gsw<< >>gul<< >>gyn<< >>hrx<< >>hrx_Latn<< >>hwc<< >>icr<< >>jam<< >>jvd<< >>kri<< >>ksh<< >>kww<< >>lim<< >>lng<< >>ltz<< >>mhn<< >>nds<< >>nld<< >>odt<< >>ofs<< >>ofs_Latn<< >>oor<< >>osx<< >>pcm<< >>pdc<< >>pdt<< >>pey<< >>pfl<< >>pih<< >>pih_Latn<< >>pis<< >>pis_Latn<< >>qlm<< >>rop<< >>sco<< >>sdz<< >>skw<< >>sli<< >>srm<< >>srm_Latn<< >>srn<< >>stl<< >>stq<< >>svc<< >>swg<< >>sxu<< >>tch<< >>tcs<< >>tgh<< >>tpi<< >>trf<< >>twd<< >>uln<< >>vel<< >>vic<< >>vls<< >>vmf<< >>wae<< >>wep<< >>wes<< >>wes_Latn<< >>wym<< >>ydd<< >>yec<< >>yid<< >>yih<< >>zea<<
  • 原始模型: opusTCv20210807_transformer-big_2022-08-11.zip
  • 获取更多信息的资源:

这是一个多语言翻译模型,具有多个目标语言。您需要提供一个句子初始的语言令牌,以形式 >>id<< (id = 有效的目标语言ID),例如 >>afr<<

使用情况

该模型可用于翻译和文本生成。

风险、限制和偏见

内容警告:读者应意识到该模型是根据可能包含令人不安、冒犯性内容的各种公共数据集进行训练的,并且可能传播历史和现实的刻板印象。

大量研究已探讨了语言模型的偏见和公平性问题(参见,例如, Sheng et al. (2021) Bender et al. (2021) )。

如何开始使用模型

一个简短的代码示例:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>nds<< Red keinen Quatsch.",
    ">>eng<< Findet ihr das nicht etwas übereilt?"
]

model_name = "pytorch-models/opus-mt-tc-big-gmw-gmw"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Kiek ok bi: Rott.
#     Aren't you in a hurry?

您还可以使用transformers pipeline来使用OPUS-MT模型,例如:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-gmw-gmw")
print(pipe(">>nds<< Red keinen Quatsch."))

# expected output: Kiek ok bi: Rott.

训练

评估

langpair testset chr-F BLEU #sent #words
afr-deu tatoeba-test-v2021-08-07 0.68679 50.4 1583 9105
afr-eng tatoeba-test-v2021-08-07 0.70682 56.6 1374 9622
afr-nld tatoeba-test-v2021-08-07 0.71516 55.5 1056 6710
deu-afr tatoeba-test-v2021-08-07 0.70274 54.3 1583 9507
deu-eng tatoeba-test-v2021-08-07 0.66023 48.6 17565 149462
deu-nds tatoeba-test-v2021-08-07 0.48058 23.2 9999 76137
deu-nld tatoeba-test-v2021-08-07 0.71440 54.6 10218 75235
deu-yid tatoeba-test-v2021-08-07 9.211 0.4 853 5355
eng-afr tatoeba-test-v2021-08-07 0.71995 56.5 1374 10317
eng-deu tatoeba-test-v2021-08-07 0.63103 42.0 17565 151568
eng-nld tatoeba-test-v2021-08-07 0.71062 54.5 12696 91796
eng-yid tatoeba-test-v2021-08-07 9.624 0.4 2483 16395
fry-eng tatoeba-test-v2021-08-07 0.40545 25.1 220 1573
fry-nld tatoeba-test-v2021-08-07 0.55771 41.7 260 1854
gos-deu tatoeba-test-v2021-08-07 0.45302 25.4 207 1168
gos-eng tatoeba-test-v2021-08-07 0.37628 24.1 1154 5635
gos-nld tatoeba-test-v2021-08-07 0.45777 26.2 1852 9903
ltz-deu tatoeba-test-v2021-08-07 0.37165 21.3 347 2208
ltz-eng tatoeba-test-v2021-08-07 0.37784 30.3 293 1840
ltz-nld tatoeba-test-v2021-08-07 0.32823 26.7 292 1685
nds-deu tatoeba-test-v2021-08-07 0.64008 45.4 9999 74564
nds-eng tatoeba-test-v2021-08-07 0.55193 38.3 2500 17589
nds-nld tatoeba-test-v2021-08-07 0.66943 50.0 1657 11490
nld-afr tatoeba-test-v2021-08-07 0.76610 62.3 1056 6823
nld-deu tatoeba-test-v2021-08-07 0.73162 56.8 10218 74131
nld-eng tatoeba-test-v2021-08-07 0.74088 60.5 12696 89978
nld-fry tatoeba-test-v2021-08-07 0.48460 31.4 260 1857
nld-nds tatoeba-test-v2021-08-07 0.43779 19.9 1657 11711
swg-deu tatoeba-test-v2021-08-07 0.40348 16.1 1523 15632
yid-deu tatoeba-test-v2021-08-07 6.305 0.1 853 5173
yid-eng tatoeba-test-v2021-08-07 3.704 0.1 2483 15452
afr-deu flores101-devtest 0.58718 30.2 1012 25094
afr-eng flores101-devtest 0.74826 55.1 1012 24721
afr-ltz flores101-devtest 0.46826 15.7 1012 25087
afr-nld flores101-devtest 0.54441 22.5 1012 25467
deu-afr flores101-devtest 0.57835 26.4 1012 25740
deu-eng flores101-devtest 0.66990 41.8 1012 24721
deu-ltz flores101-devtest 0.52554 20.3 1012 25087
deu-nld flores101-devtest 0.55710 24.2 1012 25467
eng-afr flores101-devtest 0.68429 40.7 1012 25740
eng-deu flores101-devtest 0.64888 38.5 1012 25094
eng-ltz flores101-devtest 0.49231 18.4 1012 25087
eng-nld flores101-devtest 0.57984 26.8 1012 25467
ltz-afr flores101-devtest 0.53623 23.2 1012 25740
ltz-deu flores101-devtest 0.59122 30.0 1012 25094
ltz-eng flores101-devtest 0.57557 31.0 1012 24721
ltz-nld flores101-devtest 0.49312 18.6 1012 25467
nld-afr flores101-devtest 0.52409 20.0 1012 25740
nld-deu flores101-devtest 0.53898 22.6 1012 25094
nld-eng flores101-devtest 0.58970 30.7 1012 24721
nld-ltz flores101-devtest 0.42637 11.8 1012 25087
deu-eng multi30k_test_2016_flickr 0.60928 39.9 1000 12955
eng-deu multi30k_test_2016_flickr 0.64172 35.4 1000 12106
deu-eng multi30k_test_2017_flickr 0.63154 40.5 1000 11374
eng-deu multi30k_test_2017_flickr 0.63078 34.2 1000 10755
deu-eng multi30k_test_2017_mscoco 0.55708 32.2 461 5231
eng-deu multi30k_test_2017_mscoco 0.57537 29.1 461 5158
deu-eng multi30k_test_2018_flickr 0.59422 36.9 1071 14689
eng-deu multi30k_test_2018_flickr 0.59597 30.0 1071 13703
deu-eng newssyscomb2009 0.54993 28.2 502 11818
eng-deu newssyscomb2009 0.53867 23.2 502 11271
deu-eng news-test2008 0.54601 27.2 2051 49380
eng-deu news-test2008 0.53149 23.6 2051 47447
deu-eng newstest2009 0.53747 25.9 2525 65399
eng-deu newstest2009 0.53283 22.9 2525 62816
deu-eng newstest2010 0.58355 30.6 2489 61711
eng-deu newstest2010 0.54885 25.8 2489 61503
deu-eng newstest2011 0.54883 26.3 3003 74681
eng-deu newstest2011 0.52712 23.1 3003 72981
deu-eng newstest2012 0.56153 28.5 3003 72812
eng-deu newstest2012 0.52662 23.3 3003 72886
deu-eng newstest2013 0.57770 31.4 3000 64505
eng-deu newstest2013 0.55774 27.8 3000 63737
deu-eng newstest2014 0.59826 33.2 3003 67337
eng-deu newstest2014 0.59301 29.0 3003 62688
deu-eng newstest2015 0.59660 33.4 2169 46443
eng-deu newstest2015 0.59889 32.3 2169 44260
deu-eng newstest2016 0.64736 39.8 2999 64119
eng-deu newstest2016 0.64427 38.3 2999 62669
deu-eng newstest2017 0.60933 35.2 3004 64399
eng-deu newstest2017 0.59257 30.7 3004 61287
deu-eng newstest2018 0.66797 42.6 2998 67012
eng-deu newstest2018 0.69605 46.5 2998 64276
deu-eng newstest2019 0.63749 39.7 2000 39227
eng-deu newstest2019 0.66751 42.9 1997 48746
deu-eng newstest2020 0.61200 35.0 785 38220
eng-deu newstest2020 0.60411 32.3 1418 52383
deu-eng newstestB2020 0.61255 35.1 785 37696
eng-deu newstestB2020 0.59513 31.8 1418 53092

引用信息

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

致谢

该工作得到 European Language Grid 的支持,作为 pilot project 2866 ,由欧洲研究委员会(ERC)在欧洲联盟Horizon 2020研究和创新计划(授权协议编号:771113)下资助;以及 FoTran project ,由欧洲联盟Horizon 2020研究和创新计划(授权协议编号:780069)资助。我们还感谢 CSC -- IT Center for Science 提供的慷慨计算资源和IT基础设施,芬兰。

模型转换信息

  • transformers版本:4.16.2
  • OPUS-MT git哈希值:8b9f0b0
  • 转换时间:2022年8月12日星期五23:58:31 EEST
  • 转换机器:LM0-400-22516.local