利用LLaVA-Critic评估多模态模型

2024年10月22日由 alex 发表 1045 0

LLaVA-Critic是第一个开源大型多模态模型 (LMM)，旨在充当通用评估器，能够评估各种多模态任务的性能。它引入了一种使用 AI 生成的反馈来增强模型对齐和自我批评的新方法，使其成为开发现代多模态系统的重要工具。在本文中，我们探讨了它的主要功能、数据收集过程和用例。

LLaVA-Critic概述

LLaVA-Critic旨在提供一种可扩展的解决方案，用于评估多模态模型，例如视觉语言任务。该模型在专为评估任务设计的高质量数据集上进行训练，使其能够评估模型响应并生成偏好学习的奖励信号。它提供两种主要功能：

LMM-as-a-Judge： LLaVA-Critic 可以为多模式任务提供可靠的评估分数，可与 GPT-4V 等专有模型相媲美，从而使其成为一种经济高效的替代方案。
偏好学习：通过生成人工智能驱动的奖励信号，减少对昂贵的人工反馈进行模型调整的依赖，增强基于偏好的训练。

学会评估的意义

随着大型多模态模型借助来自网络的预训练数据逐渐成熟，人们越来越有兴趣使用 AI 增强的合成数据来改进训练后模型。可靠的 AI 评估对于自动化复杂任务评估至关重要，否则评估工作可能非常耗费人力且成本高昂。特别是，准确的奖励信号对于强化学习和在推理过程中引导模型至关重要。

虽然许多模型专注于改进现实世界的视觉任务，但 LMM 在判断和评估其他模型方面的作用仍未得到探索。LLaVA-Critic 通过提供评估分数以及各种任务（例如视觉聊天）的推理来解决这一差距。

LLaVA-Critic 的主要贡献

LLaVA-Critic 引入了几项关键创新：

Critic 指令遵循数据：它基于精心策划的数据集进行训练，包括超过 46,000 张图像和 113,000 个评估样本。这些数据结合了逐点和成对评估标准，使模型能够通过定量判断和详细推理进行评估。
多模态模型作为评论家： LLaVA-Critic 扩展了现有 LMM 的功能，可作为评论家，评估模型输出并提供反馈以提高模型性能。
开源：为了支持更广泛的人工智能社区，LLaVA-Critic 团队已经发布了其指令数据、模型检查点、代码库和视觉聊天演示供公众使用。

数据收集过程

LLaVA-Critic 的训练数据是使用 GPT 辅助流程生成的，涵盖两个关键评估设置：

逐点评分：在此，模型通过直接评估或将其与参考答案进行比较，为单个答案分配分数。数据集包括问题-图像对以及相关的答案、分数和理由。
成对排名：在此设置中，模型会比较两个回答并确定哪一个质量更高。此方法用于在多个回答对上训练 LLaVA-Critic，使其能够处理复杂的偏好学习任务。

该数据集由广泛使用的多模态基准和现成的 LMM 响应构建，确保对各种任务进行全面的评估覆盖。

模型训练和微调

LLaVA-Critic 是根据预先训练的 LMM（特别是 LLaVA-OneVision (OV) 7B/72B 检查点）进行微调的，以开发其“批评”能力。该模型在LLaVA-Critic-113k数据集上训练了一个时期，使用标准交叉熵损失来预测分数、排名和理由。最终模型称为 LLaVA-Critic (v1.0)，也在较小的子集（称为 LLaVA-Critic (v0.5)）上进行了测试。

场景和用例

LLaVA-Critic 在以下场景中很有用：

场景 1：LMM-as-a-Judge——该模型提供一致的评估分数和理由，自动完成多模式基准的人工反馈等劳动密集型任务。
场景 2：偏好学习— LLaVA-Critic 生成奖励信号，基于偏好学习优化模型，减少对人工注释数据的依赖。它在偏好对齐方面也优于人机协同模型，使其成为扩展模型开发的理想解决方案。

代码使用

本节展示如何使用单个代码片段对大型多模态模型 (LMM) 响应进行成对和逐点lmms-lab/llava-critic-7b评分，使用模型。

成对评分涉及比较两个模型响应并确定哪个更好，而逐点评分则为模型响应分配一个分数。

from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import requests
import copy
import torch
import warnings
warnings.filterwarnings("ignore")
# Load the model and tokenizer
pretrained = "lmms-lab/llava-critic-7b"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)
model.eval()
# Download and process the image
url = "https://github.com/LLaVA-VL/blog/blob/main/2024-10-03-llava-critic/static/images/critic_img_seven.png?raw=True"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
image_sizes = [image.size]
# Define a function to handle critic prompts
def evaluate_image(critic_prompt, conv_template="qwen_1_5"):
    # Generate the full prompt with the image token and critic prompt
    question = DEFAULT_IMAGE_TOKEN + "\n" + critic_prompt
    conv = copy.deepcopy(conv_templates[conv_template])
    conv.append_message(conv.roles[0], question)
    conv.append_message(conv.roles[1], None)
    prompt_question = conv.get_prompt()
    # Tokenize the input
    input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
    # Generate response
    cont = model.generate(
        input_ids,
        images=image_tensor,
        image_sizes=image_sizes,
        do_sample=False,
        temperature=0,
        max_new_tokens=4096,
    )
    # Decode the generated text
    text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
    return text_outputs[0]
# Define the two critic prompts
pairwise_prompt = (
    "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answers provided by a Large Multimodal Model (LMM). "
    "Determine which answer is better and explain your reasoning with specific details. Your task is provided as follows:\n"
    "Question: [What this image presents?]\n"
    "The first response: [The image is a black and white sketch of a line that appears to be in the shape of a cross. The line is a simple and straightforward representation of the cross shape, with two straight lines intersecting at a point.]\n"
    "The second response: [This is a handwritten number seven.]\n"
    "ASSISTANT:\n"
)
pointwise_prompt = (
    "Given an image and a corresponding question, please serve as an unbiased and fair judge to evaluate the quality of the answer provided by a Large Multimodal Model (LMM). "
    "Score the response out of 100 and explain your reasoning with specific details. Your task is provided as follows:\n"
    "Question: [What this image presents?]\n"
    "The LMM response: [This is a handwritten number seven.]\n"
    "ASSISTANT:\n"
)
# Run the evaluation for both pairwise and pointwise scoring
pairwise_result = evaluate_image(pairwise_prompt)
pointwise_result = evaluate_image(pointwise_prompt)
# Print both results
print("Pairwise Evaluation Result:")
print(pairwise_result)
print("\nPointwise Evaluation Result:")
print(pointwise_result)

结论

LLaVA-Critic 展示了开源 LMM 作为通用评估器和提供可扩展的人工智能驱动反馈的潜力。通过向公众发布模型及其数据，LLaVA-Critic 为未来研究大型多模态模型的超人对齐机制奠定了基础。LLaVA-Critic 能够减少对人工反馈的需求，同时提供高质量的评估和偏好学习信号，这使它成为人工智能开发人员的强大工具。

文章来源：https://medium.com/@gautam75/evaluating-multimodal-models-with-llava-critic-9f0dc22f65dd

标签：

人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇【指南】如何在本地运行Nvidia的llama-3.1-nemotron-70b-instruct

下一篇【指南】在RAG管道中实现上下文检索

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来