RAG 技术：提升视觉问答的简易框架

2024年09月02日由 alex 发表 443 0

简介

检索增强生成（RAG）是一种强大的技术，可以提高大型语言模型（LLM）生成答案的准确性和可靠性。它还可以检查模型在特定生成过程中使用的资料来源，使人类用户更容易进行事实检查。此外，RAG 还能使模型知识保持最新，并纳入特定主题的信息，而无需进行微调。总之，RAG 优点多、缺点少，而且其工作流程简单易行。正因为如此，它已成为许多需要最新和/或专业知识的 LLM 用例的首选解决方案。

生成式人工智能领域的一些最新发展集中在扩展流行的转换器架构，以处理多种输入和/或输出模式，试图复制 LLM 的巨大成功。目前已经有多个开源和闭源模型展示了处理多种模式的卓越能力。随着 LLaVA、Idefics 和 Phi-vision 等小型但功能强大的模型的发布，视觉语言模型（VLMs）的开源贡献令人瞩目。

为多模态模型设计 RAG 系统比纯文字模型更具挑战性。事实上，用于 LLM 的 RAG 系统的设计已经非常成熟，人们对一般工作流程也有了一定的共识，因为最近的许多开发都侧重于提高准确性、可靠性和可扩展性，而不是从根本上改变 RAG 架构。另一方面，多模态技术开辟了多种检索相关信息的方法，因此可以做出几种不同的架构选择，每种选择都有自己的优点和缺点。例如，可以使用多模态嵌入模型为不同模态创建一个共享的向量空间，也可以选择只以一种模态作为信息的基础。

在这篇文章中，我将讨论一个将 RAG 扩展到视觉语言模型（VLM）的简单框架，重点是视觉问题解答任务。该方法的核心理念是利用视觉语言模型理解文本和图像的能力，生成一个合适的搜索查询，用于在回答用户提示之前检索外部信息。

我还将提供一个实用教程，介绍如何实施该框架，使 Phi-3.5-vision 能够访问维基百科信息，讨论实施要点并展示一些示例。

用于可视化问题解答的 RAG

在本节中，我将介绍引言中提到的框架的一般工作流程。为了便于说明，我将讨论只有一个用户对一张图片进行提示的情况。例如，简单的视觉问题解答（VQA）任务就是这种情况。这种方法可以直接推广到多个提示和图像，但管道会变得更加复杂，并引入更多的复杂因素。此外，我将只考虑外部数据仅由文本文档组成的情况。使用多模态嵌入模型进行检索，或者更广泛地说使用多模态搜索引擎，外部数据中也有可能包含图片。

与通常的 RAG 工作流程一样，该框架的工作流程可分为两部分：检索相关外部信息和根据所提供的外部数据生成信息。

在检索阶段，我们的目标是从外部文本文档中检索出一些段落，这些段落可以为回答用户的提示提供有用的信息。为了有效地做到这一点，我们必须确保检索到的段落与所提供的图像、提示相关，更重要的是，与两者之间的关系相关。事实上，即使检索到的文档包含有关图像的信息，它们也可能不包含回答用户提示所需的具体信息。事实上，用于 LLM 的 RAG 系统的设计已经非常成熟，人们对一般工作流程也有了一定的共识，因为最近的许多开发都侧重于提高准确性、可靠性和可扩展性，而不是从根本上改变 RAG 架构。另一方面，多模态技术开辟了多种检索相关信息的方法，因此可以做出几种不同的架构选择，每种选择都有自己的优点和缺点。例如，可以使用多模态嵌入模型为不同模态创建一个共享的向量空间，也可以选择只以一种模态作为信息的基础。

更详细地说，多模态模型接收用户的提示和图片作为输入，并负责创建与它们整体相关的搜索查询。此过程可以看作是查询转换的一个特例，旨在考虑问题的多模态性质。事实上，该模型将用户的提示转换为搜索查询，同时还考虑它所引用的图片。

与分别处理每种输入模态的其他方法（例如使用多模态嵌入模型进行检索或使用生成的图像标题/描述进行语义相似性）相比，这种方法的优势在于它可以更有效地捕捉提示和图像之间的关系。

检索阶段的流程图如下所示。

生成阶段与标准的纯文本 RAG 工作流程非常相似，唯一不同的是，除了提示和检索到的段落外，模型还接收到上下文中的图像。该流程如下图所示。

利用维基百科增强 Phi-3.5 视觉效果

在本节中，我将提供一个实用指南，介绍如何应用所讨论的框架，通过让多模态模型访问维基百科来增强该模型。我选择了 Phi-3.5-vision 模型，因为它是一个非常强大但轻量级的开源视觉语言模型。

检索

检索阶段的目标是从维基百科中收集一些段落，这些段落可以为回答用户关于图像的问题提供有用的信息。在代码实现过程中，我使用了 Python 软件包 wikipedia 来搜索和检索维基百科中的内容。

以下是实现检索相关段落的步骤：

使用多模态模型生成关键词，捕捉有关图片问题的含义。
使用生成的关键词搜索维基百科上的相关页面。
将检索到的每个页面的内容分割成块。
通过与问题和关键词的语义文本相似性选择最重要的语块。

第一步利用 Phi-3.5-vision 生成适当的搜索查询，用于检索相关的维基百科页面。为此，我让 Phi-3.5-vision 生成与用户问题和图片相关的关键词。然后，我使用维基百科软件包的内置搜索功能来检索与生成的关键词相关的一些页面。

Phi-vision-3.5 的一般单轮单图像聊天模板具有以下结构：

<|user|>\n
<|image_1|>\n
{prompt}<|end|>\n
<|assistant|>\n

为了生成关键词，我使用了以下提示：

Your task is to write a few search keywords to find Wikipedia pages containing
the relevant information to answer the question about the provided image. The 
keywords must be as specific as possible and must represent the information 
that is needed to answer the question in relation to the provided image. Don't 
write more than 3 search keywords.
Question: {question}

在推理之前，{question}标签会被用户的问题所替代。

生成关键词后，使用维基百科软件包的内置搜索功能检索与生成的关键词相关的网页。最后，将所选页面拆分成段落，然后使用嵌入模型和 FAISS 向量存储的 LangChain 实现选出最相关的段落。我使用了嵌入模型 snowflake-arctic-embed-l 来嵌入问题和关键词的串联以及检索到的网页块。在实践中，检索阶段实际上是一种 “混合检索 ”形式，包括两个连续步骤：使用维基百科软件包的内置检索功能进行关键词检索，以及使用嵌入模型进行嵌入相似性检索。通过这种方式，检索是在使用关键词搜索选出的最相关网页段落的较小空间内进行的，从而避免了建立一个包含维基百科所有内容嵌入的巨大向量存储空间的需要。在不同的情况下，检索阶段可以改变为在整个外部语料库上使用相似性检索，或使用不同的检索方法组合。

从多个页面检索段落有助于减少选错页面的几率，当需要从多个页面获取信息来生成答案时，这种方法也很有用。

生成

在生成阶段，用户的问题、检索到的段落和原始图像将作为 Phi-3.5-vision 的输入，用于生成答案。

我在 Phi-3.5-vision 的一般聊天模板中使用了以下提示：

You are a helpful assistant tasked with answering questions about the provided 
image.
Answer the following question: {question}
You can use the following passages retrieved from Wikipedia to provide your 
answer:
{passages}

在生成时，标记 {question} 和之前一样由用户问题代替，而标记 {passages} 则由检索到的段落和相应页面的名称代替，格式如下

From Wikipedia page {page_name} : "{passage1}"\n\n
From Wikipedia page {page_name} : "{passage2}"\n\n
From Wikipedia page {page_name} : "{passage3}"\n\n

当页面内容不足以唯一确定其所指主题或专题时，提供提取段落的页面名称有助于消除歧义。

示例

在本节中，我将举例说明使用上一节讨论的实现方法所获得的答案，并将使用 RAG 的视觉语言模型与基础版本的输出结果进行比较。

在下面的每个示例中，我将展示提供给模型的图片、包含问题和 RAG 增强版与基础版视觉语言模型答案的图块、包含模型创建的搜索查询的图块，以及包含从维基百科检索到的段落的图块。

示例 1

Question: How tall are the plants that produce this fruit?
Base VLM: Tomatoes are typically grown on plants that can reach heights of 2 to
4 feet, depending on the variety and growing conditions.
RAG VLM: The tomato plants that produce this fruit typically grow to 1-3 meters
(3-10 feet) in height.

VLM 生成了以下搜索关键词：

Search keywords: tomato plant, tomato plant height

并检索到以下段落：

From page Tomato (https://en.wikipedia.org/wiki/Tomato): 
 "throughout all seasons of the year. Tomato plants typically grow to 1–3 
meters (3–10 ft) in height. They are vines that have a weak stem that sprawls 
and typically needs support. Indeterminate tomato plants are perennials in 
their native habitat, but are cultivated as annuals. (Determinate, or".
From page Tomato (https://en.wikipedia.org/wiki/Tomato): 
 "Numerous varieties of the tomato plant are widely grown in temperate climates
 across the world, with greenhouses allowing for the production of tomatoes 
throughout all seasons of the year. Tomato plants typically grow to 1–3 meters
(3–10 ft) in height. They are vines that have a weak stem that".
From page Early Girl (https://en.wikipedia.org/wiki/Early_Girl): 
 "== See also ==
List of tomato cultivars

== References ==".
From page Black Krim (https://en.wikipedia.org/wiki/Black_Krim): 
 "== References ==

== See also ==
List of tomato cultivars
Heirloom tomato".
From page Celebrity tomato (https://en.wikipedia.org/wiki/Celebrity_tomato): 
 "== See also ==
List of tomato cultivars
Tomato production in Florida

== References ==".
From page Fourth of July tomato 
(https://en.wikipedia.org/wiki/Fourth_of_July_tomato): 
 "== See also ==
List of tomato cultivars

== References ==

== External links ==
Picture Gallery from Google Images".
From page Mortgage Lifter (https://en.wikipedia.org/wiki/Mortgage_Lifter): 
 "== See also ==
List of tomato cultivars

== References ==

== External links ==
Picture Gallery from Google Images".

使用 RAG 增强的模型正确地报告了维基百科页面 “番茄 ”中所述的高度范围，而基本模型的回答范围较低。

示例 2

Query: How much does this species weight?
Base VLM: Polar bears can weigh between 900 to 1,600 pounds 
(408 to 727 kilograms).
RAG VLM:  The weight of this species, the polar bear, ranges from 300-800 kg 
(660-1,760 lb) for males and 150-300 kg (330-660 lb) for females.

Search keywords: polar bear, weight

From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
"Males are generally 200–250 cm (6.6–8.2 ft) long with a weight of 300–800 kg
(660–1,760 lb). Females are smaller at 180–200 cm (5.9–6.6 ft) with a weight 
of 150–300 kg (330–660 lb). Sexual dimorphism in the species is particularly 
high compared with most other mammals. Male polar bears also have".
From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
 "== Notes ==

== References ==

== Bibliography ==

== External links ==
Polar Bears International website
ARKive—images and movies of the polar bear (Ursus maritimus)".
From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
 "weight of 150–300 kg (330–660 lb). Sexual dimorphism in the species is 
particularly high compared with most other mammals. Male polar bears also have 
proportionally larger heads than females. The weight of polar bears fluctuates 
during the year, as they can bulk up on fat and increase their mass by".
From page List of ursids (https://en.wikipedia.org/wiki/List_of_ursids): 
 "long, plus a 3–20 cm (1–8 in) tail, though the polar bear is 2.2–2.44 m 
(7–8 ft) long, and some subspecies of brown bear can be up to 2.8 m (9 ft). 
Weights range greatly from the sun bear, which can be as low as 35 kg (77 lb), 
to the polar bear, which can be as high as 726 kg (1,600 lb). Population".
From page Knut (polar bear) (https://en.wikipedia.org/wiki/Knut_(polar_bear)): 
 "== See also ==
Binky (polar bear)
List of individual bears

== References ==

== External links ==".
From page Polar bear (https://en.wikipedia.org/wiki/Polar_bear): 
 "The polar bear is the largest living species of bear and land carnivore, 
though some brown bear subspecies like the Kodiak bear can rival it in size. 
Males are generally 200–250 cm (6.6–8.2 ft) long with a weight of 300–800 kg 
(660–1,760 lb). Females are smaller at 180–200 cm (5.9–6.6 ft) with a".
From page Subspecies of brown bear 
(https://en.wikipedia.org/wiki/Subspecies_of_brown_bear): 
 "males, potentially tripling their average weight within three years' time, 
and can expect to average between 360 and 545 kg (794 and 1,202 lb). 
The reported mean adult body masses for both sexes of the polar bear are very 
similar to the peninsular giant and Kodiak bears. Due to their roughly".

同样，RAG 模型按照维基百科页面 “北极熊 ”中的权重回答正确，而基础模型的权重范围略有不同。

示例 3

让我们看看在该示例中，使用 Phi-3.5 视觉模型并同时使用和不使用 RAG 增强功能会发生什么。特别是，如果检索到任何误导性段落，该模型现在可能更容易产生幻觉。

Question: Tell me the species of the fish in this image
Base VLM: The image does not contain any fish or aquatic environments, so 
I cannot provide a species of fish.
RAG VLM:  The image does not contain any fish or aquatic environments. It is
a photograph of a desert landscape with sand dunes. 
Therefore, I cannot provide the species of fish based on this image.

基础模型和 RAG 模型都没有因误导性提示而产生幻觉，这表明与 LLaVA-1.5 相比，Phi-3.5-vision 可能具有更好的鲁棒性。特别是，RAG VLM 没有从维基百科中检索到任何页面。事实上，当被要求生成关键词时，模型的回答是

There are no fish visible in the image provided. The image shows a desert 
landscape with sand dunes. Therefore, the search keywords related to the image 
content would be more appropriate for a desert or dune-related topic.

在使用维基百科软件包搜索功能时，该搜索查询未产生任何结果。

局限性

该框架的主要局限性在于，它要求模型至少对所提供的图像和问题有一个大致的了解（而 RAG 将检索到确切的细节），以便生成有用的搜索查询。在某些特定领域的设置中，可能需要对生成搜索查询的模型进行微调。

本文章中讨论的框架也存在传统纯文本 RAG 系统常见的局限性。生成错误的常见原因是外部信息检索不正确。在某些情况下，模型无法正确识别检索阶段的失败，从而根据不正确、不相关或误导性的信息生成答案。特别是，不相关段落的出现有时会诱发幻觉，否则模型就不会产生幻觉。

另一个常见的错误来源是忽略检索到的段落或错误引用这些段落。这有时会导致检索到的段落与生成的答案之间出现细微差别，而这些差别乍一看可能很难发现。

示例

在本例中，我将展示一个模型错误引用检索到的段落的例子

Query: What's the average weight of this animal?
Base VLM: The average male lion can weigh between 160 to 310 kilograms, while 
females typically weigh between 120 to 180 kilograms.
RAG VLM:  Based on the information provided in the passages, the average weight
of a male lion is 187.5 kg (386 lb) in Southern Africa and 174.9 kg (386 lb) in
East Africa.

Search keywords: lion, weight, average weight

From page American lion (https://en.wikipedia.org/wiki/American_lion): 
 "(518lbs-1153lbs) in males and 175kg to 365 kg (385lbs-805lbs) for females. 
In 2008, the American lion was estimated to weigh up to 420 kg (930 lb). A 
study in 2009 showed an average weight of 256 kg (564 lb) for males and 351 kg
(774 lb) for the largest specimen analyzed.".
From page Lion (https://en.wikipedia.org/wiki/Lion): 
 "=== Size ===
Among felids, the lion is second only to the tiger in size. The size and weight
of adult lions vary across its range and habitats. Accounts of a few 
individuals that were larger than average exist from Africa and India.

=== Mane ===".
From page Panthera leo leo (https://en.wikipedia.org/wiki/Panthera_leo_leo): 
 "The lion's fur varies in colour from light buff to dark brown. It has rounded
ears and a black tail tuft. Average head-to-body length of male lions is 
2.47–2.84 m (8 ft 1 in – 9 ft 4 in) with a weight of 148.2–190.9 kg 
(327–421 lb). Females are smaller and less heavy. Zoological lion specimens".
From page Panthera leo melanochaita 
(https://en.wikipedia.org/wiki/Panthera_leo_melanochaita): 
 "Average head-to-body length of male lions is 2.47–2.84 m (8 ft 1 in – 9 ft 
4 in) with a weight ranging from 150–225 kg (331–496 lb) averaging 187.5 kg 
(413 lb) in Southern Africa and 145.4–204.7 kg (321–451 lb) averaging 174.9 kg
 (386 lb) in East Africa. Females average 83–165 kg (183–364 lb) in".
From page Asiatic lion (https://en.wikipedia.org/wiki/Asiatic_lion): 
 "An adult male Asiatic lion weighs 160.1 kg (353 lb) on average with the 
limit being 190 kg (420 lb); a wild female weighs 100 to 130 kg (220 to 285 lb)
.[1]".
From page List of largest mammals 
(https://en.wikipedia.org/wiki/List_of_largest_mammals): 
 "== See also ==
List of largest land carnivorans
Largest organisms
Largest prehistoric animals
List of largest birds
List of largest cats
List of largest fish
List of largest plants
List of largest reptiles
List of largest insects
List of heaviest land mammals
Smallest organisms

== Notes ==".
From page Ancient Mesopotamian units of measurement 
(https://en.wikipedia.org/wiki/Ancient_Mesopotamian_units_of_measurement): 
 "== See also ==
Assyrian lion weights
Babylonian mathematics
Historical weights and measures
Weights and measures

== References ==

=== Citations ===".

虽然以千克为单位的答案是正确的，但模型却错误地将南部非洲雄狮的平均体重换算成了磅，尽管从维基百科中摘录的相应段落中报告的数量是正确的。

结论

在这篇文章中，我展示了一个简单的框架，可以用来增强可视化问题解答的检索增强生成功能。该方法的核心思想是利用视觉语言模型生成查询，然后由标准 RAG 管道从外部语料库中检索信息。我还介绍了该框架的一个实现方案，它允许Phi-3.5视觉访问维基百科。

文章来源：https://towardsdatascience.com/a-simple-framework-for-rag-enhanced-visual-question-answering-06768094762e

标签：

人工智能机器学习

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇使用RAG和LangChain代理构建你的定制聊天机器人

下一篇使用Qdrant Cloud和LangChain构建客户支持聊天机器人

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来