数据集:
menyo20k_mt
任务:
翻译计算机处理:
translation大小:
10K<n<100K语言创建人:
found源数据集:
original预印本库:
arxiv:2103.08647许可:
cc-by-nc-4.0MENYO-20k 是一个跨领域的平行数据集,其中的文本来自于新闻文章、TED演讲、电影剧本、广播剧本、科技文本和其他从互联网和专业翻译人员那里整理的短文章。该数据集共有20,100个平行句子,分为10,070个训练句子、3,397个开发句子和6,633个测试句子(其中包括3,419个多领域句子、1,714个新闻领域句子和1,500个TED演讲文本领域句子)。
[需要更多信息]
语言为英语和约鲁巴语。
实例示例:
{'translation': {'en': 'Unit 1: What is Creative Commons?', 'yo': 'Ìdá 1: Kín ni Creative Commons?' } }
提供了训练、验证和测试划分。
[需要更多信息]
[需要更多信息]
语言原始数据的生产者是谁?[需要更多信息]
[需要更多信息]
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
该数据集是开放的,但仅限非商业用途,因为一些数据来源如TED演讲和JW新闻需要商业使用许可。
该数据集根据创作共享 Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) 许可证进行许可: https://github.com/uds-lsv/menyo-20k_MT/blob/master/LICENSE
如果您使用此数据集,请引用以下论文:
@inproceedings{adelani-etal-2021-effect, title = "The Effect of Domain and Diacritics in {Y}oruba{--}{E}nglish Neural Machine Translation", author = "Adelani, David and Ruiter, Dana and Alabi, Jesujoba and Adebonojo, Damilola and Ayeni, Adesina and Adeyemi, Mofe and Awokoya, Ayodele Esther and Espa{\~n}a-Bonet, Cristina", booktitle = "Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track)", month = aug, year = "2021", address = "Virtual", publisher = "Association for Machine Translation in the Americas", url = "https://aclanthology.org/2021.mtsummit-research.6", pages = "61--75", abstract = "Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus with a especially curated orthography for Yoruba{--}English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality and we also analyze the effect of diacritics and a major characteristic of Yoruba and in the training data. We investigate how and when this training condition affects the final quality of a translation and its understandability.Our models outperform massively multilingual models such as Google ($+8.7$ BLEU) and Facebook M2M ($+9.1$) when translating to Yoruba and setting a high quality benchmark for future research.", }
感谢 @yvonnegitau 添加了该数据集。