项目页面: M3IT
英文和中文。80个翻译版本可以在 M3IT-80 找到。
我们的数据集汇编了多样的经典视觉语言任务,包括字幕、视觉问答(VQA)、视觉条件生成、推理和分类。
Task | #Instructions |
---|---|
Image Captioning | 52 |
Classification | 113 |
Visual Question Answering | 95 |
Knowledgeable Visual QA | 40 |
Reasoning | 60 |
Generation | 40 |
Total | 400 |
Task | Description | #Train | #Val | #Test |
---|---|---|---|---|
Image Captioning | Given an image, write a description for the image. | 679,087 | 41,462 | 27,499 |
Classification | Given an image, classify the image into pre-defined categories. | 238,303 | 100,069 | 21,206 |
Visual Question Answering | Given an image, answer a question relevant to the image. | 177,633 | 46,314 | 10,828 |
Knowledgeable Visual QA | Given an image, answer the question requires outside knowledge. | 39,981 | 11,682 | 5,477 |
Reasoning | Given an image, conduct reasoning over the images. | 99,372 | 11,500 | 10,000 |
Generation | Given an image, make compositions with certain requirements. | 145,000 | 11,315 | 17,350 |
Chinese | CAP, CLS, VQA, and GEN tasks in Chinese. | 192,076 | 77,306 | 4,100 |
Video | CAP, CLS, and VQA tasks on video-language datasets. | 20,868 | 7,542 | 9,294 |
Multi-lingual | Translated tasks in 80 languages | 0 | 240,000 | 184,000 |
Task | Dataset | #Train | #Val | #Test |
---|---|---|---|---|
Image Captioning | coco | 566,747 | 25,010 | 25,010 |
textcap | 97,765 | 13,965 | 0 | |
image-paragraph-captioning | 14,575 | 2,487 | 2,489 | |
Classification | coco-goi | 30,000 | 2,000 | 0 |
coco-text | 118,312 | 27,550 | 0 | |
imagenet | 30,000 | 50,000 | 0 | |
coco-itm | 30,000 | 5,000 | 5,000 | |
snli-ve | 20,000 | 14,339 | 14,740 | |
mocheg | 4,991 | 180 | 466 | |
iqa | 5,000 | 1,000 | 1,000 | |
Visual Question Answering | vqa-v2 | 30,000 | 30,000 | 0 |
shapes | 13,568 | 1,024 | 1,024 | |
docvqa | 39,463 | 5,349 | 0 | |
ocr-vqa | 11,414 | 4,940 | 0 | |
st-vqa | 26,074 | 0 | 4,070 | |
text-vqa | 27,113 | 0 | 5,734 | |
gqa | 30,001 | 5,001 | 0 | |
Knowledgeable Visual QA | okvqa | 9,009 | 5,046 | 0 |
a-okvqa | 17,056 | 1,145 | 0 | |
science-qa | 12,726 | 4,241 | 4,241 | |
viquae | 1,190 | 1,250 | 1,236 | |
Reasoning | clevr | 30,000 | 2,000 | 0 |
nlvr | 29,372 | 2,000 | 0 | |
vcr | 25,000 | 5,000 | 5,000 | |
visual-mrc | 15,000 | 2,500 | 5,000 | |
winoground | 0 | 0 | 800 | |
Generation | vist | 5,000 | 4,315 | 4,350 |
visual-dialog | 50,000 | 1,000 | 1,000 | |
multi30k | 90,000 | 6,000 | 12,000 | |
Chinese | fm-iqa | 164,735 | 75,206 | 0 |
coco-cn | 18,341 | 1,000 | 1,000 | |
flickr8k-cn | 6,000 | 1,000 | 1,000 | |
chinese-food | 0 | 0 | 1,100 | |
mmchat | 3,000 | 1,000 | 1,000 | |
Video | ss | 2,000 | 2,000 | 2,000 |
ivqa | 5,994 | 2,000 | 2,000 | |
msvd-qa | 1,161 | 245 | 504 | |
activitynet-qa | 3,200 | 1,800 | 800 | |
msrvtt | 6,513 | 497 | 2,990 | |
msrvtt-qa | 2,000 | 1,000 | 1,000 |
# OR run huggingface-cli login from huggingface_hub import login hf_token = "hf_xxx" # TODO: set a valid HuggingFace access token for loading datasets/models login(token=hf_token)
from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name)
from datasets import load_dataset ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] validation_set = dataset["validation"] test_set = dataset["test"]
from datasets import load_dataset from io import BytesIO from base64 import b64decode from PIL import Image ds_name = "coco" # change the dataset name here dataset = load_dataset("MMInstruction/M3IT", ds_name) train_set = dataset["train"] for train_instance in train_set: instruction = train_instance["instruction"] # str inputs = train_instance["inputs"] # str outputs = train_instance["outputs"] # str image_base64_str_list = train_instance["image_base64_str"] # str (base64) image_0 = Image.open(BytesIO(b64decode(image_base64_str_list[0])))
import datasets features = datasets.Features( { "instruction": datasets.Value("string"), "inputs": datasets.Value("string"), "image_base64_str": [datasets.Value("string")], "outputs": datasets.Value("string"), } )
[需要更多信息]
Task | Dataset [Citation] | Source |
---|---|---|
Image Captioning | coco [1] | 1238321 |
textcap [2] | 1239321 | |
image-paragraph-captioning [3] | 12310321 | |
Classification | coco-goi [1] | 1238321 |
coco-text [4] | 12312321 | |
imagenet [5] | 12313321 | |
coco-itm [1] | 1238321 | |
snli-ve [6] | 12315321 | |
mocheg [7] | 12316321 | |
iqa [8] | 12317321 | |
Visual Question Answering | vqa-v2 [9] | 12318321 |
shapes [10] | 12319321 | |
docvqa [11] | 12320321 | |
ocr-vqa [12] | 12321321 | |
st-vqa [13] | 12322321 | |
text-vqa [14] | 12323321 | |
gqa [15] | 12324321 | |
Knowledgeable Visual QA | okvqa [16] | 12325321 |
a-okvqa [17] | 12326321 | |
science-qa [18] | 12327321 | |
viquae [19] | 12328321 | |
Reasoning | clevr [20] | 12329321 |
nlvr [21] | 12330321 | |
vcr [22] | 12331321 | |
visual-mrc [23] | 12332321 | |
winoground [24] | 12333321 | |
Generation | vist [25] | 12334321 |
visual-dialog [26] | 12335321 | |
multi30k [27] | 12336321 | |
Chinese | fm-iqa [28] | 12337321 |
coco-cn [29] | 12338321 | |
flickr8k-cn [30] | 12339321 | |
chinese-food [31] | 12340321 | |
mmchat [32] | 12341321 | |
Video | ss [33] | 12342321 |
ivqa [34] | 12343321 | |
msvd-qa [35] | 12344321 | |
activitynet-qa [36] | 12345321 | |
msrvtt [35] | 12346321 | |
msrvtt-qa [37] | 12347321 |
为构建高质量的多模态指令数据集,我们将各种数据集重写为多模态到文本对话格式。注释过程包括四个步骤:
本工作的八位作者是人工注释者,每位作者都是熟悉相关文献的研究生。
原始数据集的内容遵循其原始授权协议。我们建议,对于未知/定制许可证的任务,用户可查看原始项目或联系数据集所有者获取详细的许可信息。
我们的注释指令数据受 CC BY 4.0 许可。
@article{li2023m3it, title={M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning}, author={Lei Li and Yuwei Yin and Shicheng Li and Liang Chen and Peiyi Wang and Shuhuai Ren and Mukai Li and Yazheng Yang and Jingjing Xu and Xu Sun and Lingpeng Kong and Qi Liu}, journal={arXiv preprint arXiv:2306.04387}, year={2023} }
M3IT是一个开源的、大规模的多模态、多语言指令调优数据集,旨在实现通用多模态智能体的开发。