英文

这是 PromptCap: Prompt-Guided Task-Aware Image Captioning 论文的存储库

我们介绍了PromptCap,这是一个可以通过自然语言指令进行控制的字幕模型。指令可以包含用户感兴趣的问题。例如,"男孩在做什么?"。PromptCap还支持通用描述,使用问题"这个图片在描述什么?"

PromptCap可以作为LLM(如GPT-3、ChatGPT)的轻量级视觉插件(比BLIP-2快得多),也可以与Segment Anything和DINO等基础模型搭配使用。它在COCO字幕生成任务上实现了SOTA性能(150 CIDEr)。当与GPT-3结合,并根据用户问题进行条件约束时,PromptCap在基于知识的VQA任务上取得了SOTA性能(OK-VQA上为60.4%,A-OKVQA上为59.6%)

快速入门

安装

pip install promptcap

包含两个流程。一个用于图片字幕生成,另一个用于视觉问答。

字幕生成流程

请按照提示的格式进行操作,以获得最佳性能。

通过如下方式生成由提示引导的字幕:

import torch
from promptcap import PromptCap

model = PromptCap("vqascore/promptcap-coco-vqa")  # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"

if torch.cuda.is_available():
  model.cuda()

prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

要尝试通用描述,请使用"这个图片在描述什么?"

prompt = "what does the image describe?"
image = "glove_boy.jpeg"

print(model.caption(prompt, image))

PromptCap还支持接受OCR输入:

prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(model.caption(prompt, image, ocr))

视觉问答流程

PromptCap与典型的VQA模型不同,它是开放域的,并且可以与任意文本QA模型搭配使用。这里提供了将PromptCap与UnifiedQA结合的流程。

import torch
from promptcap import PromptCap_VQA

# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000"
vqa_model = PromptCap_VQA(promptcap_model="vqascore/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base")

if torch.cuda.is_available():
  vqa_model.cuda()

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"

print(vqa_model.vqa(question, image))

类似地,PromptCap支持接受OCR输入

question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"

print(vqa_model.vqa(question, image, ocr=ocr))

由于Unifiedqa的灵活性,PromptCap还支持多项选择的VQA

question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))

Bibtex

@article{hu2022promptcap,
  title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
  author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
  journal={arXiv preprint arXiv:2211.09699},
  year={2022}
}