数据集:
turuta/Multi30k-uk
Multi30K数据集旨在开展多语言和多模态的研究。
最初,该数据集通过添加德语翻译来扩展了Flickr30K数据集。这些描述是从众包平台收集的,而翻译是由专业合同翻译人员收集的。
我们提供了一个手动翻译成乌克兰语的此数据集的变体。
论文:
@inproceedings{saichyshyna-etal-2023-extension, title = "Extension {M}ulti30{K}: Multimodal Dataset for Integrated Vision and Language Research in {U}krainian", author = "Saichyshyna, Nataliia and Maksymenko, Daniil and Turuta, Oleksii and Yerokhin, Andriy and Babii, Andrii and Turuta, Olena", booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.unlp-1.7", pages = "54--61", abstract = "We share the results of the project within the well-known Multi30k dataset dedicated to improving machine translation of text from English into Ukrainian. The main task was to manually prepare the dataset and improve the translation of texts. The importance of collecting such datasets for low-resource languages for improving the quality of machine translation has been discussed. We also studied the features of translations of words and sentences with ambiguous meanings.The collection of multimodal datasets is essential for natural language processing tasks because it allows the development of more complex and comprehensive machine learning models that can understand and analyze different types of data. These models can learn from a variety of data types, including images, text, and audio, for more accurate and meaningful results.", }