这是 PromptCap: Prompt-Guided Task-Aware Image Captioning 论文的存储库
我们介绍了PromptCap,这是一个可以通过自然语言指令进行控制的字幕模型。指令可以包含用户感兴趣的问题。例如,"男孩在做什么?"。PromptCap还支持通用描述,使用问题"这个图片在描述什么?"
PromptCap可以作为LLM(如GPT-3、ChatGPT)的轻量级视觉插件(比BLIP-2快得多),也可以与Segment Anything和DINO等基础模型搭配使用。它在COCO字幕生成任务上实现了SOTA性能(150 CIDEr)。当与GPT-3结合,并根据用户问题进行条件约束时,PromptCap在基于知识的VQA任务上取得了SOTA性能(OK-VQA上为60.4%,A-OKVQA上为59.6%)
pip install promptcap
包含两个流程。一个用于图片字幕生成,另一个用于视觉问答。
请按照提示的格式进行操作,以获得最佳性能。
通过如下方式生成由提示引导的字幕:
import torch from promptcap import PromptCap model = PromptCap("vqascore/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large" if torch.cuda.is_available(): model.cuda() prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?" image = "glove_boy.jpeg" print(model.caption(prompt, image))
要尝试通用描述,请使用"这个图片在描述什么?"
prompt = "what does the image describe?" image = "glove_boy.jpeg" print(model.caption(prompt, image))
PromptCap还支持接受OCR输入:
prompt = "please describe this image according to the given question: what year was this taken?" image = "dvds.jpg" ocr = "yip AE Mht juor 02/14/2012" print(model.caption(prompt, image, ocr))
PromptCap与典型的VQA模型不同,它是开放域的,并且可以与任意文本QA模型搭配使用。这里提供了将PromptCap与UnifiedQA结合的流程。
import torch from promptcap import PromptCap_VQA # QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000" vqa_model = PromptCap_VQA(promptcap_model="vqascore/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base") if torch.cuda.is_available(): vqa_model.cuda() question = "what piece of clothing is this boy putting on?" image = "glove_boy.jpeg" print(vqa_model.vqa(question, image))
类似地,PromptCap支持接受OCR输入
question = "what year was this taken?" image = "dvds.jpg" ocr = "yip AE Mht juor 02/14/2012" print(vqa_model.vqa(question, image, ocr=ocr))
由于Unifiedqa的灵活性,PromptCap还支持多项选择的VQA
question = "what piece of clothing is this boy putting on?" image = "glove_boy.jpeg" choices = ["gloves", "socks", "shoes", "coats"] print(vqa_model.vqa_multiple_choice(question, image, choices))
@article{hu2022promptcap, title={PromptCap: Prompt-Guided Task-Aware Image Captioning}, author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo}, journal={arXiv preprint arXiv:2211.09699}, year={2022} }