模型:

tifa-benchmark/promptcap-coco-vqa

任务:

图生文

类库:

PyTorch Transformers

数据集:

coco textvqa VQAv2 OK-VQA A-OKVQA 3AA-OKVQA 3AOK-VQA 3AVQAv2 3Atextvqa 3Acoco

语言:

其他:

ofa 视觉问答 image-captioning

预印本库:

arxiv:2211.09699

许可:

openrail

模型介绍文件清单

英文

这是 PromptCap: Prompt-Guided Task-Aware Image Captioning 论文的存储库

我们介绍了PromptCap，这是一个可以通过自然语言指令进行控制的字幕模型。指令可以包含用户感兴趣的问题。例如，"男孩在做什么？"。PromptCap还支持通用描述，使用问题"这个图片在描述什么？"

PromptCap可以作为LLM（如GPT-3、ChatGPT）的轻量级视觉插件（比BLIP-2快得多），也可以与Segment Anything和DINO等基础模型搭配使用。它在COCO字幕生成任务上实现了SOTA性能（150 CIDEr）。当与GPT-3结合，并根据用户问题进行条件约束时，PromptCap在基于知识的VQA任务上取得了SOTA性能（OK-VQA上为60.4%，A-OKVQA上为59.6%）

快速入门

安装

pip install promptcap

包含两个流程。一个用于图片字幕生成，另一个用于视觉问答。

字幕生成流程

请按照提示的格式进行操作，以获得最佳性能。

通过如下方式生成由提示引导的字幕：

import torch
from promptcap import PromptCap

model = PromptCap("vqascore/promptcap-coco-vqa")  # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

要尝试通用描述，请使用"这个图片在描述什么？"

prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

PromptCap还支持接受OCR输入：

prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))

视觉问答流程

PromptCap与典型的VQA模型不同，它是开放域的，并且可以与任意文本QA模型搭配使用。这里提供了将PromptCap与UnifiedQA结合的流程。

import torch
from promptcap import PromptCap_VQA

# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000"
vqa_model = PromptCap_VQA(promptcap_model="vqascore/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base")

if torch.cuda.is_available():
  vqa_model.cuda()

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(vqa_model.vqa(question, image))

类似地，PromptCap支持接受OCR输入

question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(vqa_model.vqa(question, image, ocr=ocr))

由于Unifiedqa的灵活性，PromptCap还支持多项选择的VQA

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))

Bibtex

@article{hu2022promptcap,
  title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}

作者:

TIFA

数据集大小:

2.28 GB