使用LLaVA创建你的视觉聊天助手

2023年11月14日由 alex 发表 950 0

介绍

大型语言模型已经证明了自身作为一项革命性技术的价值。利用这些模型能力的众多应用程序已经被开发出来，预计还将有更多即将推出。大型语言模型最有趣的应用之一是作为智能助手的部署，这些助手能够帮助人类用户完成各种任务。通过指令调整和人类反馈强化学习（RLHF）训练的聊天模型，已经展现出遵循人类指令和完成分配任务的很有前景的能力。然而，它们对于纯语言任务的适用性受到限制。

多模态对话模型旨在释放大型语言模型的能力，以解决需要结合自然语言和其他模态才能解决的问题。尤其是视觉-语言模型，自从引入了GPT-4V的视觉功能后，这方面的关注正在不断增长。GPT-4的自然语言能力与图像理解能力的结合，孕育了一个强大的聊天助手，可以帮助用户完成需要视觉和语言理解的任务。尽管GPT-4V的视觉能力令人印象深刻，但闭源模型限制了与这项惊人技术进行研究和实验的可能性。幸运的是，一些开源模型出现了，以一种易于获取且透明的方式，将视觉语言模型的力量带给了社区。这些模型还继续增加了计算和内存效率方面的关注，这一趋势对于大型语言模型的开源版已经很常见了。这是一个重要的特点，因为它有助于这些模型的广泛采用。

在本文中，我将介绍如何使用LLaVA（大型语言及视觉助手）模型创建一个视觉聊天助手的过程。首先，我将简要介绍LLaVA模型及其改进，然后讨论使用官方仓库提供的代码实现一个视觉聊天助手的简单代码实现。接着，我将展示一些我设计的例子，以展示该模型的能力和局限性。

LLaVA

LLaVA模型在论文《VisualInstructionTuning》中被介绍，然后在《ImprovedBaselineswithVisualInstructionTuning》（也称为LLaVA-1.5）中进一步改进。其背后的想法是从图像中提取视觉嵌入，并通过将它们提供给大型语言模型，以与来自语言标记的嵌入相同的方式处理它们。直观上，我们可以认为图像将用“单词”来描述，语言模型将用“单词”来生成答案。为了选择正确的“单词”，模型使用预先训练的 CLIP 视觉编码器来提取视觉嵌入，然后将它们投影到语言模型的单词嵌入空间中。后一个操作是通过视觉语言连接器完成的，最初在第一篇论文《视觉指令调优》中选择了简单的线性层，后来在《改进的视觉指令基线》中用更具表现力的多层感知器（MLP）代替。该模型的架构如下所示。

这个方法的一个优势是，通过使用一个预训练的视觉编码器和一个预训练的语言模型，只需要从头开始学习视觉-语言连接器（这是一个轻量级的模块）。特别地，LLaVA的训练包括两个阶段：

特征对齐的预训练：预训练的视觉编码器和语言模型都是冻结的，只更新视觉-语言连接器的权重。所有训练样本都包含文本-图像对，打包成单轮对话。这一阶段的目标是训练视觉-语言连接器对齐视觉编码器的嵌入与语言模型的文本嵌入。

基于视觉指令的微调：在这个阶段，只有视觉编码器的权重是冻结的，而视觉-语言连接器和语言模型一起进行微调。模型在基于图像的指令跟随任务上进行微调。值得注意的是，一些数据是通过使用仅限语言的GPT4从图像的标题和所描绘的实体的边界框坐标中创建指令跟随样本而产生的。

视觉聊天机器人的实现

使用官方仓库提供的代码创建一个视觉聊天机器人相当容易。该仓库还提供了标准化的聊天模板，可以用来以正确的格式解析输入。在训练中使用的正确格式对于模型生成答案的质量至关重要。确切的模板取决于所使用的语言模型。带有预训练的Vicuna语言模型的LLaVA-1.5的模板会是这样的：

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end> User's prompt
ASSISTANT: Assistant answer
USER: Another prompt
这前几行是模型使用的常规系统提示。特殊标记<im_start>、<image>和<im_end>用于指示将放置代表图像的嵌入。
聊天机器人可以仅用一个简单的Python类来定义。
class LLaVAChatBot:
    def __init__(self,
                 model_path: str = 'liuhaotian/llava-v1.5-7b',
                 device_map: str = 'auto',
                 load_in_8_bit: bool = True,
                 **quant_kwargs) -> None:
        self.model = None
        self.tokenizer = None
        self.image_processor = None
        self.conv = None
        self.conv_img = None
        self.img_tensor = None
        self.roles = None
        self.stop_key = None
        self.load_models(model_path,
                         device_map=device_map,
                         load_in_8_bit=load_in_8_bit,
                         **quant_kwargs)
    def load_models(self, model_path: str,
                    device_map: str,
                    load_in_8_bit: bool,
                    **quant_kwargs) -> None:
        """Load the model, processor and tokenizer."""
        quant_cfg = BitsAndBytesConfig(**quant_kwargs)
        self.model = LlavaLlamaForCausalLM.from_pretrained(model_path,
                                                           low_cpu_mem_usage=True,
                                                           device_map=device_map,
                                                           load_in_8bit=load_in_8_bit,
                                                           quantization_config=quant_cfg)
        self.tokenizer = AutoTokenizer.from_pretrained(model_path,
                                                       use_fast=False)
        vision_tower = self.model.get_vision_tower()
        vision_tower.load_model()
        vision_tower.to(device='cuda')
        self.image_processor = vision_tower.image_processor
        disable_torch_init()
    def setup_image(self, img_path: str) -> None:
        """Load and process the image."""
        if img_path.startswith('http') or img_path.startswith('https'):
            response = requests.get(img_path)
            self.conv_img = Image.open(BytesIO(response.content)).convert('RGB')
        else:
            self.conv_img = Image.open(img_path).convert('RGB')
        self.img_tensor = self.image_processor.preprocess(self.conv_img,
                                                          return_tensors='pt'
                                                          )['pixel_values'].half().cuda()
    def generate_answer(self, **kwargs) -> str:
        """Generate an answer from the current conversation."""
        raw_prompt = self.conv.get_prompt()
        input_ids = tokenizer_image_token(raw_prompt,
                                          self.tokenizer,
                                          IMAGE_TOKEN_INDEX,
                                          return_tensors='pt').unsqueeze(0).cuda()
        stopping = KeywordsStoppingCriteria([self.stop_key],
                                            self.tokenizer,
                                            input_ids)
        with torch.inference_mode():
            output_ids = self.model.generate(input_ids,
                                             images=self.img_tensor,
                                             stopping_criteria=[stopping],
                                             **kwargs)
        outputs = self.tokenizer.decode(
            output_ids[0, input_ids.shape[1]:]
        ).strip()
        self.conv.messages[-1][-1] = outputs
        return outputs.rsplit('</s>', 1)[0]
    def get_conv_text(self) -> str:
        """Return full conversation text."""
        return self.conv.get_prompt()
    def start_new_chat(self,
                       img_path: str,
                       prompt: str,
                       do_sample=True,
                       temperature=0.2,
                       max_new_tokens=1024,
                       use_cache=True,
                       **kwargs) -> str:
        """Start a new chat with a new image."""
        conv_mode = "v1"
        self.setup_image(img_path)
        self.conv = conv_templates[conv_mode].copy()
        self.roles = self.conv.roles
        first_input = (DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN +
                       DEFAULT_IM_END_TOKEN + '\n' + prompt)  # f"{self.roles[0]}: {prompt}")
        self.conv.append_message(self.roles[0], first_input)
        self.conv.append_message(self.roles[1], None)
        if self.conv.sep_style == SeparatorStyle.TWO:
            self.stop_key = self.conv.sep2
        else:
            self.stop_key = self.conv.sep
        answer = self.generate_answer(do_sample=do_sample,
                                      temperature=temperature,
                                      max_new_tokens=max_new_tokens,
                                      use_cache=use_cache,
                                      **kwargs)
        return answer
    def continue_chat(self,
                      prompt: str,
                      do_sample=True,
                      temperature=0.2,
                      max_new_tokens=1024,
                      use_cache=True,
                      **kwargs) -> str:
        """Continue the existing chat."""
        if self.conv is None:
            raise RuntimeError("No existing conversation found. Start a new"
                               "conversation using the `start_new_chat` method.")
        self.conv.append_message(self.roles[0], prompt)
        self.conv.append_message(self.roles[1], None)
        answer = self.generate_answer(do_sample=do_sample,
                                      temperature=temperature,
                                      max_new_tokens=max_new_tokens,
                                      use_cache=use_cache,
                                      **kwargs)
        return answer

如果你熟悉transformers库，你将会认出许多常见的功能，而且所执行的操作应该很容易理解。让我们快速回顾一下上面定义的LLaVAChatBot类的方法。

load_models：这个方法加载语言模型、分词器和以指定参数进行量化的图像处理器，使用BitsAndBytes库。代码遮蔽了Hugging Face transformers模型使用的from_pretrained方法。BitsAndBytes允许将模型量化到8位或4位，以减少GPU内存需求。
setup_image：它从本地路径或URL加载图像，并使用图像处理器将其转换为张量。
generate_answer：这个方法返回模型针对提供的图像继续当前对话的答案。同样，LLaVa模型的generate方法与Hugging Face transformers模型的generate方法相似。
get_conv_text：这个方法返回迄今为止对话的原始文本。
start_new_chat：这是chatbot的两个主要方法之一，用于与模型开始新的聊天。它创建一个新的对话，给出图像和初始提示传递给模型。它负责使用存储库中定义的模板设置对话，遵循上一节讨论的格式。
continue_chat：另一个主要方法，它继续关于图像的现有对话。

示例

我收集了一些在测试模型能力时得到的有趣示例。本文中的所有示例都是使用8位量化的模型llava-v1.5-7b创建的。

作为开始，让我们看看模型描述和理解图像的能力。下面我让模型描述了一张描绘白老虎的图片。

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end>
Describe the image in details. 
ASSISTANT: The image features a white tiger laying down on a rocky surface, 
possibly a stone wall or a rock. The tiger appears to be relaxed and 
comfortable in its environment. The tiger's position is such that it covers a 
significant portion of the image, from the left side to almost the entire right
side. The rocky surface provides a natural and rugged backdrop for the scene.</s></s>
USER: What is unique about this tiger? 
ASSISTANT: This tiger is unique because it is a white tiger, which is an 
extremely rare and visually striking variation of the species. White tigers are
born with a genetic mutation that causes their fur to be predominantly white, 
with the rest of their body being a mix of orange and black stripes. This 
distinctive appearance sets them apart from the more common orange and black 
striped tigers.</s></s>

模型似乎能够详细描述图像，并且能够对图像的主题进行推理，正确评估白老虎是一种罕见现象。

接下来，我测试了模型是否能够为用户提供执行各种任务的实际指示。这是将“字幕”模型与真正的视觉聊天助手区分开的特点。例如，模型能否根据图像中存在的食材提供食谱？结果如下所述。

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end>
What can I cook with these ingredients? 
ASSISTANT: With the ingredients displayed on the table, you can prepare a 
delicious roast dinner. The meat, which is a large piece of beef, can be 
seasoned and cooked in the oven, while the vegetables, including carrots, 
potatoes, and onions, can be roasted alongside the meat. You can also prepare a
side dish with the potatoes, carrots, and onions, which can be served alongside
the roast dinner. The presence of a bowl of sauce suggests that you can use it 
as a dipping sauce or to enhance the flavors of the roast and vegetables. 
Overall, this combination of ingredients offers a hearty and flavorful meal 
that can be enjoyed with family and friends.</s></s>
USER: Could the recipe you suggested be included in a vegetarian diet? 
ASSISTANT: The recipe I suggested, which involves roasting a large piece of 
beef and vegetables, is not suitable for a vegetarian diet. The beef is the 
main protein source in the dish, and it is not suitable for vegetarians who 
avoid consuming meat. However, you can still create a vegetarian-friendly meal 
using the vegetables and sauce availableon the table. For example, you can 
prepare a vegetable roast with the carrots, potatoes, and onions, and serve it 
alongside a side of mashed potatoes or a vegetable-based sauce. This way, you 
can still enjoy the flavors and textures of the dish without including meat.</s></s>
USER: Can you suggest a vegetarian recipe with minimal modifications of the 
depicted ingredients? 
ASSISTANT: Certainly! Here's a vegetarian recipe that can be made using the 
ingredients available on the table:
Roasted Vegetable Medley with Sauce
Ingredients:
* Carrots
* Potatoes
* Onions
* Olive oil
* Salt
* Pepper
* Herbs (such as thyme, rosemary, or parsley)
* Sauce (optional)
Instructions:
1. Preheat your oven to 400°F (200°C).
2. Wash and peel the carrots, potatoes, and onions. Cut them into bite-sized 
pieces.
3. In a large bowl, toss the vegetables with olive oil, salt, and pepper.
4. Spread the vegetables evenly on a baking sheet lined with parchment paper.
5. Roast the vegetables in the preheated oven for 20-25 minutes, or until they 
are tender and slightly browned.
6. If desired, prepare a sauce by combining ingredients such as tomatoes, 
garlic, and herbs. You can use this sauce as a dipping sauce or to enhance the
flavors of the roasted vegetables.
7. Serve the roasted vegetable medley alongside a side of mashed potatoes or a 
vegetable-based sauce.
This vegetarian recipe allows you to enjoy the flavors and textures of the 
roasted vegetables without including meat. The sauce can be easily customized 
to suit your taste preferences, and the dish can be served alongside a variety 
of side dishes, such as mashed potatoes or a vegetable-based sauce.</s></s>

这个模型似乎很好地理解了图像中呈现的成分，并且它也能将单个词汇与更抽象的概念联系起来（例如，含肉的餐点不是素食）。

现在我们来看看如何使用这个模型来获得建议。这是在头脑风暴时获得新想法的一个有用功能，可以为用户提供真正的价值。在下面的对话中，我请求模型提出改变咖啡馆外观的建议。

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end>
How can I renovate this café to make it look more vintage? 
ASSISTANT: To make the café look more vintage, you can consider the following 
changes:
1. Update the lighting: Replace the current lighting with vintage-style lamps 
or lanterns, which will create a warm and cozy atmosphere.
2. Add vintage decorations: Incorporate antique items, such as vintage clocks, 
artwork, or decorative plates, to create a nostalgic ambiance.
3. Use vintage furniture: Replace the current stools with vintage-style chairs 
or barstools, which can add character and charm to the space.
4. Incorporate greenery: Add more greenery, such as potted plants or vines, to 
create a lush and inviting atmosphere.
5. Update the bar: Replace the current bar with a vintage-style bar, complete 
with a vintage-inspired backsplash and countertop.
6. Use vintage-style glassware: Replace the current glassware with vintage-style
glasses, such as wine glasses, beer mugs, or cocktail shakers.
7. Add vintage-style signage: Replace the current signage with vintage-style 
menu boards or chalkboards, which can add a touch of nostalgia to the space.
8. Use vintage-style wallpaper or paint: Apply a vintage-style wallpaper or 
paint to the walls, which can create a unique and charming atmosphere.
By implementing these changes, the café will appear more vintage and inviting, 
creating a memorable experience for its patrons.</s></s>

这些建议看上去很中肯，尽管有些泛泛，它指出了当下哪些因素可以修改以赋予更多复古风格。

让我们继续进行一些实际的内容，以帮助用户在日常活动中。这个模型能帮忙写广告或者列表吗？

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end>
Write an advertisment for this chair 
ASSISTANT: Introducing the modern and stylish black and wood chair, perfect for
any contemporary setting. This chair features a sleek black seat and a wooden 
base, creating a striking contrast against any background. Its minimalist 
design makes it a versatile addition to any room, whether it's used as a dining
chair, a side table, or a statement piece in a living space. The chair's sturdy
construction ensures it can withstand daily use, while its clean lines and 
sleek appearance add a touch of sophistication to any room. Upgrade your living
space with this stylish and functional chair today.</s></s>
USER: Write now an advertisement for the same chair but targeting cafes 
ASSISTANT: Attention café owners and design enthusiasts! Introducing the black
and wood chair, the perfect addition to your café's seating area. This stylish
and functional chair offers a modern touch to your establishment, with its 
sleek black seat and wooden base creating a striking contrast against any 
background. Its minimalist design makes it a versatile choice for any café 
setting, whether it's used as a dining chair, a side table, or a statement 
piece in a lounge area. The chair's sturdy construction ensures it can 
withstand daily use, while its clean lines and sleek appearance add a touch of 
sophistication to any room. Upgrade your café's ambiance with this stylish and 
functional chair today.</s></s>

模型能够正确地找出椅子的特点，并创建一个吸引人的广告，在被要求时可以正确地针对不同的目标进行定制。

作为最后的测试，让我们看看之前被证实对大型语言模型有益的提示技术是否也可以应用于LLaVA。下面是一个尝试零次学习链式思维提示的示例。

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end>
Would the fish depicted in the image float if thrown in water? Think step by 
step. 
ASSISTANT: The fish depicted in the image is a large white fish sculpture or 
statue, which is lying on the ground on a brick walkway near a body of water. 
If thrown into the water, the fish would not float because it is a solid, 
inanimate object. It would sink to the bottom of the water, as it is not 
buoyant due to its composition and size. The fish's position on the ground and 
its lack of buoyancy make it impossible for it to float in water.</s></s>
USER: What material could be used to make a fish sculpture that floats on 
water? 
ASSISTANT: To make a fish sculpture that floats on water, one could use a 
lightweight, buoyant material such as expanded polystyrene foam or polyethylene
foam. These materials are commonly used for making buoyant objects like life 
jackets, flotation devices, and fishing floats. By shaping the foam into the 
desired fish form, it would be able to float on the water's surface, 
simulating the behavior of a real fish.</s></s>

该模型能够解释为什么鱼不会浮起来，正确识别出它不是一条真正的鱼。它还遵循了“一步一步思考”之前给出最终答案的指示。作为比较，这里有一个没有“一步一步思考”提示的对话示例。

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.
USER: <im_start><image><im_end>
Would the fish depicted in the image float if thrown in water? 
ASSISTANT: No, the fish depicted in the image would not float if thrown in 
water. It is a large, white, stone fish sculpture, which is not a real fish 
and does not have the buoyancy properties of a natural fish.</s></s>

虽然最终答案依然是正确的，但在这个例子中提供的解释要少得多，而且答案是在句子的开头给出的。因为LLaVA使用的生成性语言模型是因果的，这意味着所提供的答案并不依赖于随后的解释。进行更广泛的实验来测试思维链（CoT）是否能一般性地改善视觉-语言模型在复杂推理任务中的性能将会非常有趣，类似于在“用于大型语言模型的思维链提示激发推理”和“大型语言模型是零样本推理器”中观察到的。

局限性

尽管LLaVA功能强大，但它也有它的局限性。例如，这个模型仅被训练为每次对话使用一张图像，因此它无法处理需要与多张图像交互的更复杂的对话。然而，值得注意的是，这一限制并非模型架构固有的，可以通过在训练中包含多图像对话数据来解决。

在测试中，我注意到该模型（至少是1.5 7b模型的8位量化版本）在计算图像中的元素数量以及一般性地处理数值任务方面存在困难。下面是一个例子。

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end>
How many car are inside the parking lot? 
ASSISTANT: There are several cars parked inside the parking lot.</s></s>
USER: List the numbers of the parking spots occupied by cars 
ASSISTANT: The parking spots occupied by cars are 66, 67, 68, 69, 70, and 71.</s></s>

这个模型，和所有大型语言模型一样，也容易出现幻觉，并且可以通过适当的提示轻易地被诱导犯错误。例如，如果给出一张沙漠的图片，并被问及图片中的鱼是什么物种，它会幻想出一种沙漠鱼。

A chat between a curious user and an artificial intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions. 
USER: <im_start><image><im_end>
Tell me the species of the fish in this image 
ASSISTANT: The fish in this image is a desert fish, which is a species adapted 
to living in desert environments.</s></s>

结论

LLaVA 在视觉-语言理解方面显示出令人印象深刻的能力。它为多模态开源视觉-语言模型迈出了明确的一步。LLaVA 最大的优点之一是易于训练和微调。例如，完整训练一套 LLaVA 1.5 13b 模型只需要约120万数据和大约1天时间，在单个8-A100节点上即可完成。这使其适合针对特定领域的微调，以获得专家助手，正如在LLaVA-Med中所做的那样：一天内训练一个大型生物医药语言和视觉助手。

为聊天助手增加视觉能力，扩展了此类模型的应用范围，使其革命性潜力延伸到更加复杂和微妙的任务。将图像特征当作语言令牌对待，也引出了使用所有先进引导技术的可能性，这些技术过去仅用于文本语言模型，并将其进一步扩展。例如，可以通过检索与对话相关的文本和图像来扩展检索增强生成技术的力量。实际上，使用CLIP的共享图像-文本嵌入空间，可以根据输入文本或图片来检索外部文档和外部图像！

扩展模型能力的另一个有趣方向呈现在LLaVA-Interactive中：一个用于图像聊天、分割、生成和编辑的全能演示。主要思想是将视觉-语言聊天模型、文本到图像生成模型以及其他视觉模型（如图像分割模型）的各种能力相结合，以获得一个能够处理多模态输入和生成多模态输出的助手。

总的来说，LLaVA 为开源多模态生成模型标志着一个重要的步骤，这些模型已经显示出令人印象深刻的能力，吸引了大量的兴趣。随着开源模型的更广泛采用，我相信我们将很快见证这些强大模型的新应用迅速增加。

文章来源：https://medium.com/towards-data-science/create-your-vision-chat-assistant-with-llava-610b02c3283e

标签：

LLaVA

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇解决LLM的幻觉

下一篇使用Immudb探索不变性：数据库安全的革命性方法

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来