数据集:

menyo20k_mt

任务:

翻译

语言:

en yo

计算机处理:

translation

大小:

10K<n<100K

语言创建人:

found

源数据集:

original

预印本库:

arxiv:2103.08647
英文

MENYO-20k 数据集卡片

数据集概要

MENYO-20k 是一个跨领域的平行数据集,其中的文本来自于新闻文章、TED演讲、电影剧本、广播剧本、科技文本和其他从互联网和专业翻译人员那里整理的短文章。该数据集共有20,100个平行句子,分为10,070个训练句子、3,397个开发句子和6,633个测试句子(其中包括3,419个多领域句子、1,714个新闻领域句子和1,500个TED演讲文本领域句子)。

支持的任务和排行榜

[需要更多信息]

语言

语言为英语和约鲁巴语。

数据集结构

数据实例

实例示例:

{'translation':
  {'en': 'Unit 1: What is Creative Commons?',
  'yo': 'Ìdá 1: Kín ni Creative Commons?'
  }
}

数据字段

  • translation:
    • en: 英语句子。
    • yo: 约鲁巴语句子。

数据划分

提供了训练、验证和测试划分。

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和标准化

[需要更多信息]

语言原始数据的生产者是谁?

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

该数据集是开放的,但仅限非商业用途,因为一些数据来源如TED演讲和JW新闻需要商业使用许可。

该数据集根据创作共享 Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) 许可证进行许可: https://github.com/uds-lsv/menyo-20k_MT/blob/master/LICENSE

引用信息

如果您使用此数据集,请引用以下论文:

@inproceedings{adelani-etal-2021-effect,
    title = "The Effect of Domain and Diacritics in {Y}oruba{--}{E}nglish Neural Machine Translation",
    author = "Adelani, David  and
      Ruiter, Dana  and
      Alabi, Jesujoba  and
      Adebonojo, Damilola  and
      Ayeni, Adesina  and
      Adeyemi, Mofe  and
      Awokoya, Ayodele Esther  and
      Espa{\~n}a-Bonet, Cristina",
    booktitle = "Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track)",
    month = aug,
    year = "2021",
    address = "Virtual",
    publisher = "Association for Machine Translation in the Americas",
    url = "https://aclanthology.org/2021.mtsummit-research.6",
    pages = "61--75",
    abstract = "Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus with a especially curated orthography for Yoruba{--}English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality and we also analyze the effect of diacritics and a major characteristic of Yoruba and in the training data. We investigate how and when this training condition affects the final quality of a translation and its understandability.Our models outperform massively multilingual models such as Google ($+8.7$ BLEU) and Facebook M2M ($+9.1$) when translating to Yoruba and setting a high quality benchmark for future research.",
}

贡献者

感谢 @yvonnegitau 添加了该数据集。