数据集:

MMInstruction/M3IT-80

英文

M3IT-80 数据集卡片

项目页面: https://m3-it.github.io/

语言

从英语翻译成80种语言。

数据集元信息

M3IT 数据集汇集了经典的视觉-语言任务,包括字幕生成、视觉问答(VQA)、视觉条件生成、推理和分类。M3IT-80 是 M3IT 的 80 种语言翻译版本。

语言

_LAN_CODES = [
    "af", "am", "ar", "as", "ast", "be", "bg", "bn", "bs", "ca",
    "ceb", "cs", "cy", "da", "de", "el", "es", "et", "fi", "fr",
    "fuv", "gl", "gu", "ha", "he", "hi", "hr", "hu", "hy", "id",
    "ig", "is", "it", "ja", "jv", "ka", "kk", "km", "kn", "ko",
    "ky", "lb", "lg", "lij", "li", "ln", "lo", "lt", "lv", "mi",
    "mk", "ml", "mr", "mt", "my", "nl", "ny", "oc", "pa", "pl",
    "pt", "ro", "ru", "sd", "sk", "sn", "so", "sr", "sv", "ta",
    "te", "tg", "th", "tl", "tr", "uk", "ur", "vi", "wo", "zh",
]

数据集统计

我们报告了每种语言的训练/验证/测试数据集数量。

Task Dataset #Train #Val #Test
Classification imagenet 500 500 0
Visual Question Answering vqa-v2 500 500 0
Knowledgeable Visual QA okvqa 500 500 0
Reasoning winoground 0 0 800
Generation vist 500 500 500
Video msrvtt 500 500 0
msrvtt-qa 500 500 0

源数据

源语言: 英语

Task Dataset [Citation] Source
Classification imagenet [1] 1239321
Visual Question Answering vqa-v2 [2] 12310321
Knowledgeable Visual QA okvqa [3] 12311321
Reasoning winoground [4] 12312321
Generation vist [5] 12313321
Video msrvtt [6] 12314321
msrvtt-qa [7] 12315321

翻译

我们使用免费的 Alibaba Translate ,即一种深度神经网络翻译(NMT)系统,来执行翻译任务。

数据集结构

HuggingFace 登录(可选)

# OR run huggingface-cli login
from huggingface_hub import login

hf_token = "hf_xxx"  # TODO: set a valid HuggingFace access token for loading datasets/models
login(token=hf_token)

数据加载

from datasets import load_dataset

ds_name = "okvqa-zh"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT-80", ds_name)

数据拆分

from datasets import load_dataset

ds_name = "okvqa-zh"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT-80", ds_name)
train_set = dataset["train"]
validation_set = dataset["validation"]
test_set = dataset["test"]

数据实例

from datasets import load_dataset
from io import BytesIO
from base64 import b64decode
from PIL import Image

ds_name = "okvqa-zh"  # change the dataset name here
dataset = load_dataset("MMInstruction/M3IT-80", ds_name)
train_set = dataset["train"]

for train_instance in train_set:
    instruction = train_instance["instruction"]  # str
    inputs = train_instance["inputs"]  # str
    outputs = train_instance["outputs"]  # str
    image_base64_str_list = train_instance["image_base64_str"]  # str (base64)
    image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))

数据字段

import datasets

features = datasets.Features(
    {
        "instruction": datasets.Value("string"),
        "inputs": datasets.Value("string"),
        "image_base64_str": [datasets.Value("string")],
        "outputs": datasets.Value("string"),
    }
)

许可信息

原始数据集的内容遵循其原始许可证。我们建议对于未知/自定义许可证的任务,用户可以查看原始项目或与数据集所有者联系以获取详细的许可证信息。

我们的注释指令数据采用 CC BY 4.0 许可。

引用信息

@article{li2023m3it,
  title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning},
  author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu},
  journal={arXiv preprint arXiv:2306.04387},
  year={2023}
}

贡献

M3IT-80 是 M3IT 的翻译版本,是一个开源的、大规模的多模式、多语言指令调优数据集,旨在实现通用多模式代理的开发。

参考文献

  • [1] Imagenet大规模视觉识别挑战
  • [2] 使V在VQA中有意义:提高图像理解在视觉问答中的作用
  • [3] OK-VQA:需要外部知识的视觉问答基准
  • [4] WinoGround:探测视觉和语言模型的视觉-语言组合性
  • [5] 视觉叙事
  • [6] 通过逐步细化的外观和动作关注进行视频问答
  • [7] MSR-VTT:用于连接视频和语言的大型视频描述数据集