数据集:
MMInstruction/M3IT-80
项目页面: https://m3-it.github.io/
从英语翻译成80种语言。
M3IT 数据集汇集了经典的视觉-语言任务,包括字幕生成、视觉问答(VQA)、视觉条件生成、推理和分类。M3IT-80 是 M3IT 的 80 种语言翻译版本。
_LAN_CODES = [ "af", "am", "ar", "as", "ast", "be", "bg", "bn", "bs", "ca", "ceb", "cs", "cy", "da", "de", "el", "es", "et", "fi", "fr", "fuv", "gl", "gu", "ha", "he", "hi", "hr", "hu", "hy", "id", "ig", "is", "it", "ja", "jv", "ka", "kk", "km", "kn", "ko", "ky", "lb", "lg", "lij", "li", "ln", "lo", "lt", "lv", "mi", "mk", "ml", "mr", "mt", "my", "nl", "ny", "oc", "pa", "pl", "pt", "ro", "ru", "sd", "sk", "sn", "so", "sr", "sv", "ta", "te", "tg", "th", "tl", "tr", "uk", "ur", "vi", "wo", "zh", ]
我们报告了每种语言的训练/验证/测试数据集数量。
Task | Dataset | #Train | #Val | #Test |
---|---|---|---|---|
Classification | imagenet | 500 | 500 | 0 |
Visual Question Answering | vqa-v2 | 500 | 500 | 0 |
Knowledgeable Visual QA | okvqa | 500 | 500 | 0 |
Reasoning | winoground | 0 | 0 | 800 |
Generation | vist | 500 | 500 | 500 |
Video | msrvtt | 500 | 500 | 0 |
msrvtt-qa | 500 | 500 | 0 |
源语言: 英语
Task | Dataset [Citation] | Source |
---|---|---|
Classification | imagenet [1] | 1239321 |
Visual Question Answering | vqa-v2 [2] | 12310321 |
Knowledgeable Visual QA | okvqa [3] | 12311321 |
Reasoning | winoground [4] | 12312321 |
Generation | vist [5] | 12313321 |
Video | msrvtt [6] | 12314321 |
msrvtt-qa [7] | 12315321 |
我们使用免费的 Alibaba Translate ,即一种深度神经网络翻译(NMT)系统,来执行翻译任务。
# OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token)
from datasets import load_dataset ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name)
from datasets import load_dataset ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"]
from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "okvqa-zh" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT-80", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))
import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } )
原始数据集的内容遵循其原始许可证。我们建议对于未知/自定义许可证的任务,用户可以查看原始项目或与数据集所有者联系以获取详细的许可证信息。
我们的注释指令数据采用 CC BY 4.0 许可。
@article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} }
M3IT-80 是 M3IT 的翻译版本,是一个开源的、大规模的多模式、多语言指令调优数据集,旨在实现通用多模式代理的开发。