利用LoRA实现视觉语言模型的高效微调

2024年12月02日由 alex 发表 957 0

介绍

视觉-语言模型（VLM）在人工智能领域中正变得至关重要，使系统能够处理和理解图像和文本，以完成各种任务。关键应用包括图像标注、视觉问答、文档理解和OCR（从图像中读取文本）。这些任务对于依赖高质量视觉和文本分析的行业（如电子商务、医疗保健和金融）至关重要。目前，存在一系列功能强大的开源多模态模型，特别是能够处理广泛任务的VLM。一些领先的模型包括：

LLaVa和LLava-NeXT
Pixtral-12B
Llama-3.2–11B-Vision
Qwen2-VL
Molmo

虽然这些预训练的多模态模型通常提供了一个坚实的基础，并且在图像标注和文档理解等一般任务上表现有效，因为它们已经从大型、多样化的数据集中学习了模式，但在你自己的数据集上对这些模型进行微调可以进一步提高其在更特定应用中的性能。微调变得至关重要的场景包括：

领域适应
任务专业化
资源优化
文化和区域背景

通过使模型更好地满足任务的独特需求，微调可以提高目标应用的准确性和效率。

在本文中，我们将探索如何使用一系列强大的工具对Meta AI的Llama-3.2–11B-Vision模型进行微调。我们将利用Unsloth进行高效的模型加载和训练，利用LoRA进行优化参数更新，并整合Weights & Biases（WandB）以实现无缝实验跟踪。微调完成后，我们可以使用vLLM进行模型服务和推理，确保高性能部署。

工具和技术概览

Unsloth：

一个针对视觉-语言模型（VLM）和大型语言模型（LLM）微调的优化框架，提供高达30倍的训练加速，同时内存使用量减少60%。
支持包括NVIDIA、AMD和Intel GPU在内的多种硬件配置，采用智能权重优化技术以提高内存效率。

LoRA（低秩适应）：

一种高效的微调技术，避免修改所有模型参数。
为模型添加小型可训练层，以实现针对特定任务的适应。
降低GPU内存需求，使其能够在标准硬件上使用。
是平衡资源效率和微调性能的理想选择。

Weights & Biases（W&B）：

一个用于监控训练指标、管理实验和可视化性能的跟踪工具。
确保团队之间的可重复性和协作。

逐步指南：微调与部署

安装所需库

!pip install torch==2.5.1 transformers==4.46.2 datasets wandb huggingface_hub python-dotenv --no-cache-dir | tail -n 1 
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes xformers==0.0.28.post3 --no-cache | tail -n 1

设置 Weights & Biases

为了监控微调过程并记录不同的实验，我们将使用 Weights & Biases（W&B）。这可以让你自动记录训练进度，包括损失曲线，可视化指标，比较模型版本，并随时间跟踪模型性能。

import os
import wandb
from dotenv import load_dotenv
load_dotenv()
def setup_wandb(project_name: str, run_name: str):
    # Set up your API KEY
    try:
        api_key = os.getenv("WANDB_API_KEY")
        wandb.login(key=api_key)
        print("Successfully logged into WandB.")
    except KeyError:
        raise EnvironmentError("WANDB_API_KEY is not set in the environment variables.")
    except Exception as e:
        print(f"Error logging into WandB: {e}")
    
    # Optional: Log models
    os.environ["WANDB_LOG_MODEL"] = "checkpoint"
    os.environ["WANDB_WATCH"] = "all"
    os.environ["WANDB_SILENT"] = "true"
    
    # Initialize the WandB run
    try:
        wandb.init(project=project_name, name=run_name)
        print(f"WandB run initialized: Project - {project_name}, Run - {run_name}")
    except Exception as e:
        print(f"Error initializing WandB run: {e}")
# Setup Weights & Biases
setup_wandb(project_name="<project_name>", run_name="<run_name>")

HuggingFace 身份验证

在微调我们的模型之后，我们将把它上传到 Hugging Face Hub。为此，我们首先需要检索并验证我们的 Hugging Face 令牌来进行身份验证。此令牌授予上传模型以及与 Hugging Face 资源交互的权限。我们将在本文后面详细介绍上传过程。

from huggingface_hub import login
hf_token = os.getenv("HUGGINGFACE_TOKEN")
if hf_token is None:
    raise EnvironmentError("HUGGINGFACE_TOKEN is not set in the environment variables.")
login(hf_token)

准备训练数据集

在我们的训练中，我们将使用 HuggingFaceM4/the_cauldron 数据集，特别是其中的 geomverse 子集。该子集专为涉及图像和文本的几何问题解决和数学推理的多模态任务而设计。每个样本都包含一张说明几何问题的图像，以及对该问题的文本描述和分步解决方案。这使得它非常适合需要整合图像和文本来解决问题的多模态模型。

为了提高微调过程的效率，我们将从数据集中选择一个包含3,000个样本的子集，而不是使用完整的训练拆分。这种方法可以让我们在减少计算开销的同时，快速评估模型的性能。

from datasets import load_dataset
from PIL import Image
# Loading the dataset
dataset_id = "HuggingFaceM4/the_cauldron"
subset = "geomverse"
dataset = load_dataset(dataset_id, subset, split="train")
# Selecting a subset of 3K samples for fine-tuning
dataset = dataset.select(range(3000))
print(f"Using a sample size of {len(dataset)} for fine-tuning.")
print(dataset)

现在，让我们来看看数据集中的第5个样本，以检查其文本和图像内容。

dataset[5]

为了更好地理解我们数据集中的图像数据，我们可以检查其属性。以下是检索图像的模式、大小和类型的方法：

# Print the mode of the image in dataset[5]
print(f"Image Mode: {dataset[5]['images'][0].mode}")
# Print the size of the image in dataset[5]
print(f"Image Size: {dataset[5]['images'][0].size}")
# Print the type of the image in dataset[5]
print(f"Image Type: {type(dataset[5]['images'][0])}")
# Display the image - dataset[5]["images"][0].show()
print("Displaying the Image:")
small_image = dataset[5]["images"][0].copy()  # Create a copy to avoid modifying the original
small_image.thumbnail((400, 400))             # Resize to fit within 400x400 pixels
small_image.show()

在本节中，我们定义了用于图像预处理的实用函数，以优化训练数据并将数据集结构化为正确的格式：

convert_to_rgb：

确保所有图像均为RGB格式。
通过合成白色背景来处理alpha通道。

reduce_image_size：

将图像缩小到较小的尺寸，以提高内存和计算效率。

format_data：

结合文本和图像数据来结构化数据集。
将每个样本组织为“用户”和“助手”角色。
为用于多模态任务的对话模型微调准备数据集。

这种结构化的方法确保了处理文本和图像输入的模型能够进行无缝训练。

def convert_to_rgb(image):
    """Convert image to RGB format if not already in RGB."""
    if image.mode == "RGB":
        return image
    image_rgba = image.convert("RGBA")
    background = Image.new("RGBA", image_rgba.size, (255, 255, 255))
    alpha_composite = Image.alpha_composite(background, image_rgba)
    return alpha_composite.convert("RGB")

def reduce_image_size(image, scale=0.5):
    """Reduce image size by a given scale."""
    original_width, original_height = image.size
    new_width = int(original_width * scale)
    new_height = int(original_height * scale)
    return image.resize((new_width, new_height))

def format_data(sample):
    """Format the dataset sample into structured messages."""
    image = sample["images"][0]
    image = convert_to_rgb(image)  
    image = reduce_image_size(image)
    return {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": sample["texts"][0]["user"],
                    },
                    {
                        "type": "image",
                        "image": image,  
                    }
                ],
            },
            {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": sample["texts"][0]["assistant"],
                    }
                ],
            },
        ],
    }

# Transform the dataset
converted_dataset = [format_data(sample) for sample in dataset]

现在，让我们看看在应用上述转换之后，我们的转换后的数据是什么样子。

converted_dataset[5]

加载我们的视觉模型

此设置使用Unsloth的FastVisionModel初始化Llama-3.2–11B-Vision-Instruct模型，并包含以下参数：

梯度检查点（use_gradient_checkpointing="unsloth"）：显著减少内存使用量，对于处理长上下文序列特别有用。
量化（load_in_4bit=False）：保持默认的16位精度（LoRA）以获得更好的准确性，尽管可以设置为4位量化（QLoRA）以节省内存。

import torch
from unsloth import FastVisionModel 
model_name = "unsloth/Llama-3.2-11B-Vision-Instruct"
model, tokenizer = FastVisionModel.from_pretrained(
    model_name = model_name,
    load_in_4bit = False,                     # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth",   # True or "unsloth" for long context
)

你可以使用其他模型，例如：

unsloth/Qwen2-VL-7B-Instruct
unsloth/Pixtral-12B-2409
unsloth/llava-v1.6-mistral-7b-hf

配置LoRA以进行参数高效的微调

在本节中，我们配置了LoRA（低秩适应）以进行参数高效的微调，通过仅关注模型的关键部分而不是微调所有参数来优化训练并减少内存使用。以下是关键参数的细分：

finetune_vision_layers=True：启用视觉层的微调，使模型能够专门适应视觉任务。
finetune_language_layers=True：启用语言层的微调，使模型能够调整以适应与语言相关的任务。
finetune_attention_modules=True：启用注意力层的微调，这有助于模型关注输入序列的重要部分。
finetune_mlp_modules=True：允许微调MLP层，这对于模型内部表示的转换至关重要。
r=8：设置LoRA矩阵的秩，通过控制层的低秩近似来平衡模型性能与内存效率。
lora_alpha=16：LoRA的缩放因子，用于控制低秩矩阵对模型最终权重的影响程度。
lora_dropout=0：将训练期间的丢失率设置为零，以确保训练的一致性而不引入随机性。
bias=”none”：指定在微调期间不使用额外的偏置项。
random_state=3407：通过固定随机种子来确保训练的可重复性。
use_rslora=False：禁用秩敏感的LoRA，选择标准的LoRA配置，这种配置更高效但可能无法很好地捕捉复杂模式。
loftq_config=None：禁用LoftQ，这是一种高级初始化方法，可以提高准确性，但会在开始时增加内存使用。

此配置允许对选定层进行高效的微调，优化模型以适应任务，同时最小化计算开销。

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers
    r = 8,           
    lora_alpha = 16,  
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None
)

评估基础视觉模型

在我们进行任何微调之前，让我们首先检查一下原始模型的性能。我们将使用TextStreamer类来流式传输生成的文本输出，从而实现实时响应流式传输。

FastVisionModel.for_inference(model)         # Enable for inference!
image = dataset[5]["images"][0]
instruction = dataset[5]["texts"][0]["user"]
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {
                "type": "text",
                "text": instruction
            },
        ]
    }
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

你会发现，尽管模型正确地解释了形状，但数学和几何推理是错误的，导致答案过长、输出不正确，并且出现了一些幻觉的迹象。

使用SFTTrainer和Unsloth进行训练

此代码配置了并启动了使用trl库中的SFTTrainer对视觉模型进行训练的过程。它首先通过SFTConfig设置超参数，包括批处理大小、学习率和优化器设置，同时根据硬件支持启用混合精度训练。然后，使用FastVisionModel.for_training()准备模型进行训练。训练器使用模型、分词器和自定义数据整理器（UnslothVisionDataCollator）进行初始化，以便对视觉进行微调。此设置确保了多模态任务，特别是基于视觉的模型的高效训练、资源管理和日志记录。

from trl import SFTTrainer, SFTConfig
from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
args = SFTConfig(
        per_device_train_batch_size = 2, # Controls the batch size per device
        gradient_accumulation_steps = 4, # Accumulates gradients to simulate a larger batch
        warmup_steps = 5,
        num_train_epochs = 3,            # Number of training epochs
        learning_rate = 2e-4,            # Sets the learning rate for optimization
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,            # Regularization term for preventing overfitting
        lr_scheduler_type = "linear",   # Chooses a linear learning rate decay
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb",            # Enables WandB logging
        logging_steps = 1,              # Sets frequency of logging 
        logging_strategy = "steps",
        save_strategy = "no",
        load_best_model_at_end = True,
        save_only_model = False,
        # You MUST put the below items for vision finetuning:
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    )

FastVisionModel.for_training(model)    # Enable for training!
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = converted_dataset,
    args = args,
)

这段代码在训练开始时捕获了初始的GPU内存状态。

# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

既然我们已经完成了设置，那就开始训练我们的模型吧。

trainer_stats = trainer.train()
print(trainer_stats)
wandb.finish()

训练过程结束后，下面的代码会检查和比较最终的内存使用情况，特别捕获用于LoRA训练的内存，并计算内存百分比。

# Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

我们可以在WandB上可视化训练指标和系统指标，如内存使用情况、训练时长、训练损失和准确率等，以更好地了解模型随时间推移的性能表现。

保存和部署模型

在对视觉-语言模型进行微调后，会将训练好的模型本地保存并上传到Hugging Face Hub，以便轻松访问和将来部署。你可以使用Hugging Face的push_to_hub进行在线存储，或者使用save_pretrained进行本地保存。

然而，此过程仅保存LoRA适配器，而不是完整的合并模型。这是因为在使用LoRA时，只训练适配器权重，而不是整个模型。因此，在保存模型时，仅存储适配器权重，而不保存完整模型。

# Local saving
model.save_pretrained("<lora_model_name>") 
tokenizer.save_pretrained("<lora_model_name>")
# Online saving
model.push_to_hub("<hf_username/lora_model_name>", token = hf_token)
tokenizer.push_to_hub("<hf_username/lora_model_name>", token = hf_token)

为了将LoRA适配器与原始基础模型合并，并以16位精度保存模型以优化vLLM的性能，你可以使用merged_16bit选项。这允许你将微调后的模型保存为float16格式。

# Merge to 16bit
model.save_pretrained_merged("<model_name>", tokenizer, save_method = "merged_16bit",)
model.push_to_hub_merged("<hf_username/model_name>", tokenizer, save_method = "merged_16bit", token = hf_token)

模型评估

完成LoRA微调过程后，我们现在将通过从数据集（未用于训练的数据）中加载一张样本图像及其对应的数学问题陈述来测试模型的性能，以评估模型如何解释和响应这些问题。

from unsloth import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
    model_name = "<lora_model_name>",   # Trained model either locally or from huggingface
    load_in_4bit = False,
)
FastVisionModel.for_inference(model)         # Enable for inference!

image = dataset[-1]["images"][0]
instruction = dataset[-1]["texts"][0]["user"]
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {
                "type": "text",
                "text": instruction
            },
        ]
    }
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 1024,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

仅使用约3k个样本进行微调，你就会发现模型的性能有了显著提升。模型不仅正确解释了形状，还展示了准确的数学和几何推理能力，能够生成简洁且精确的答案，没有任何幻觉现象。

结论

多模态/视觉人工智能正在通过使模型能够无缝处理视觉和文本数据来改变各行各业。使用像Unsloth这样的工具可以简化并提高效率，为特定应用微调这些模型，因为Unsloth减少了训练时间和内存使用，而LoRA则实现了参数高效的微调。同时，与Weights & Biases的集成可以帮助你有效地跟踪和分析实验。这些工具共同为研究人员和企业提供了释放多模态人工智能全部潜力的能力，以实现实用且有影响力的应用场景。

文章来源：https://gautam75.medium.com/fine-tuning-vision-language-models-using-lora-b640c9af8b3c

标签：

视觉语言模型人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇 OpenAI Whisper在CPU上的推理性能比较

下一篇用LangChain和CrewAI构建AI驱动的SQL数据分析代理

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来