LLaVA-Critic是第一个开源大型多模态模型 (LMM),旨在充当通用评估器,能够评估各种多模态任务的性能。它引入了一种使用 AI 生成的反馈来增强模型对齐和自我批评的新方法,使其成为开发现代多模态系统的重要工具。在本文中,我们探讨了它的主要功能、数据收集过程和用例。
LLaVA-Critic概述
LLaVA-Critic旨在提供一种可扩展的解决方案,用于评估多模态模型,例如视觉语言任务。该模型在专为评估任务设计的高质量数据集上进行训练,使其能够评估模型响应并生成偏好学习的奖励信号。它提供两种主要功能:
学会评估的意义
随着大型多模态模型借助来自网络的预训练数据逐渐成熟,人们越来越有兴趣使用 AI 增强的合成数据来改进训练后模型。可靠的 AI 评估对于自动化复杂任务评估至关重要,否则评估工作可能非常耗费人力且成本高昂。特别是,准确的奖励信号对于强化学习和在推理过程中引导模型至关重要。
虽然许多模型专注于改进现实世界的视觉任务,但 LMM 在判断和评估其他模型方面的作用仍未得到探索。LLaVA-Critic 通过提供评估分数以及各种任务(例如视觉聊天)的推理来解决这一差距。
LLaVA-Critic 的主要贡献
LLaVA-Critic 引入了几项关键创新:
数据收集过程
LLaVA-Critic 的训练数据是使用 GPT 辅助流程生成的,涵盖两个关键评估设置:
该数据集由广泛使用的多模态基准和现成的 LMM 响应构建,确保对各种任务进行全面的评估覆盖。
模型训练和微调
LLaVA-Critic 是根据预先训练的 LMM(特别是 LLaVA-OneVision (OV) 7B/72B 检查点)进行微调的,以开发其“批评”能力。该模型在LLaVA-Critic-113k数据集上训练了一个时期,使用标准交叉熵损失来预测分数、排名和理由。最终模型称为 LLaVA-Critic (v1.0),也在较小的子集(称为 LLaVA-Critic (v0.5))上进行了测试。
场景和用例
LLaVA-Critic 在以下场景中很有用:
代码使用
本节展示如何使用单个代码片段对大型多模态模型 (LMM) 响应进行成对和逐点lmms-lab/llava-critic-7b评分,使用模型。
成对评分涉及比较两个模型响应并确定哪个更好,而逐点评分则为模型响应分配一个分数。
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import requests
import copy
import torch
import warnings
warnings.filterwarnings("ignore")
# Load the model and tokenizer
pretrained = "lmms-lab/llava-critic-7b"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.eval()
# Download and process the image
url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
image_sizes = [image.size]
# Define a function to handle critic prompts
def evaluate_image(critic_prompt, conv_template="qwen_1_5"):
# Generate the full prompt with the image token and critic prompt
question = DEFAULT_IMAGE_TOKEN + "\n" + critic_prompt
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
# Tokenize the input
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
# Generate response
cont = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=4096,
)
# Decode the generated text
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
return text_outputs[0]
# Define the two critic prompts
pairwise_prompt = (
"Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). "
"Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\n"
"Question: [What this image presents?]\n"
"The first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\n"
"The second response: [This is a handwritten number seven.]\n"
"ASSISTANT:\n"
)
pointwise_prompt = (
"Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answer provided by a Large Multimodal Model (LMM). "
"Score the response out of 100 and explain your reasoning with specific details. Your task is provided as follows:\n"
"Question: [What this image presents?]\n"
"The LMM response: [This is a handwritten number seven.]\n"
"ASSISTANT:\n"
)
# Run the evaluation for both pairwise and pointwise scoring
pairwise_result = evaluate_image(pairwise_prompt)
pointwise_result = evaluate_image(pointwise_prompt)
# Print both results
print("Pairwise Evaluation Result:")
print(pairwise_result)
print("\nPointwise Evaluation Result:")
print(pointwise_result)
结论
LLaVA-Critic 展示了开源 LMM 作为通用评估器和提供可扩展的人工智能驱动反馈的潜力。通过向公众发布模型及其数据,LLaVA-Critic 为未来研究大型多模态模型的超人对齐机制奠定了基础。LLaVA-Critic 能够减少对人工反馈的需求,同时提供高质量的评估和偏好学习信号,这使它成为人工智能开发人员的强大工具。