【指南】使用Langchain和Gemini进行图像提取

2024年08月28日由 alex 发表 1014 0

你是否有一大批图片，想要使用语言模型对其进行注释？本文将指导你利用 Langchain 和 Gemini-Flash-1.5，从图片中提取内容并返回结构化属性。我们已经为一家大型在线零售商实施了这一功能，处理了数千张产品图片。通过描述图片并提取颜色和搜索引擎标签等附加属性，我们可以生成结构化数据，从而在上传到电子商务网站时提升用户体验和 SO 排名。

我们在各种子任务中结合使用 LLM 和 Langchain 来高效解析 LLM 输出。Langchain 还能通过并行操作提高性能。

我们的图片

在本案例中，我拍摄了一些水果的图片；我们将使用这些图片创建提取管道，并尝试使用 LLm 对图片进行注释。

# Images to extract data from
fruits = ['https://storage.googleapis.com/vectrix-public/fruit/apple.jpeg',
          'https://storage.googleapis.com/vectrix-public/fruit/banana.jpeg',
          'https://storage.googleapis.com/vectrix-public/fruit/kiwi.jpeg',
          'https://storage.googleapis.com/vectrix-public/fruit/peach.jpeg',
          'https://storage.googleapis.com/vectrix-public/fruit/plum.jpeg']

直接向模型传递图像

我们可以不使用 Langchain 直接将图像传递给 LLM。让我们用 Gemini Flash 模型来测试一下，看看它的反应如何。确保将 API 密钥设置为名为 GOOGLE_API_KEY 的环境变量。

示例代码

from langchain_core.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
import base64, httpx
# Initialize the model
model = ChatGoogleGenerativeAI(model="gemini-1.5-flash")
# Download and encode the image
image_data = base64.b64encode(httpx.get(fruits[0]).content).decode("utf-8")
# Create a message with the image
message = HumanMessage(
    content=[
        {"type": "text", "text": "describe the fruit in this image"},
        {
            "type": "image_url",
            "image_url": {"url": f_"data:image/jpeg;base64,{image_data}"},
        },
    ],
)
# Invoke the model with the message
response = model.invoke([message])
# Print the model's response
print(response.content)

模型回应

模型会对图像中的水果做出详细描述。例如

The fruit is an apple. It is red and yellow, with a small stem on top. The apple has a dimple in the center where the stem was attached. The apple is slightly bruised.

工作原理

初始化模型：我们使用 Langchain 的 ChatGoogleGenerativeAI 软件包来初始化 Gemini Flash 模型。
下载图像并编码：下载图像并使用 base64 编码进行编码。
创建消息：我们将创建一个包含文本提示和编码图片的 HumanMessage。
调用模型：使用信息调用模型，处理图像以生成描述。
打印响应：打印模型的响应，提供水果的详细描述。

通过使用 ChatGoogleGenerativeAI 软件包，我们可以直接与 Gemini Flash 模型交互，传递图像并接收描述性响应。通过这种方法，我们可以快速测试模型的能力，并了解它是如何处理图像输入的。

既然我们已经有了图像并知道了如何调用模型，那就来建立提取管道吧。

步骤 1：定义输出结构

下一步是从图像中提取结构化数据。我们可以通过将 Pydantic 解析器与多模态信息相结合来实现这一目标。首先，我们定义一个 Pydantic 数据模型，然后将其传递给模型，从图像中提取结构化数据。

定义数据模型

我们使用 Pydantic 定义数据模型，这有助于确保提取的数据是结构化的并经过验证。在本示例中，我们定义了一个水果模型，其中的字段包括图片中水果的名称、颜色、味道和市场描述。

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
# Define a Pydantic model to parse the model's output
class Fruit(BaseModel):
    name: str = Field(description="The name of the fruit shown in the image")
    color: str = Field(description="The color of the fruit shown in the image")
    taste: str = Field(description="The taste of the fruit shown in the image")
    marketing_description: str = Field(description="A marketing description of the fruit shown in the image")
parser = PydanticOutputParser(pydantic_object=Fruit)

设置提示

我们使用 ChatPromptTemplate 创建一个提示，要求模型以所需的 JSON 结构返回响应。该提示包括系统消息和人类消息，其中人类消息提供以 base64 编码的图片 URL。

prompt = ChatPromptTemplate.from_messages([
    ("system", "Return the requested response object in {language}.\n'{format_instructions}'\n"),
    ("human", [
        {
            "type": "image_url",
            "image_url": {"url": _"data:image/jpeg;base64,{image_data}"},
        },
    ]),
])

结合提示符、模型和解析器

然后，我们将提示、模型和解析器组合成一个链。模型处理图像并以 JSON 格式返回数据，然后 PydanticOutputParser 根据 Fruit 数据模型对其进行解析和验证。

chain = prompt | model | parser
# Retrieve the encoded image data
image_data = base64.b64encode(httpx.get(fruits[3]).content).decode("utf-8")
# Run the chain and print the result
print(chain.invoke({
    "language": "English",
    "format_instructions": parser.get_format_instructions(),
    "image_data": image_data
}).json(indent=2))

工作原理

创建提示： ChatPromptTemplate 会构建一个提示，指示模型以特定的 JSON 格式做出响应。
模型处理：多模式 LLM（Gemini-Flash-1.5）处理图像并生成包含结构化数据的 JSON 响应。
解析和验证： PydanticOutputParser 会解析 JSON 响应，并根据 Fruit 数据模型进行验证，确保数据结构正确并符合定义的模式。

如上例所示，我们创建了一个新链，将提示、图像和格式指令结合在一起，要求模型以所需的 JSON 结构返回响应。然后，我们使用 PydanticOutputParser 从 LLM 响应中提取 JSON 并将其加载为字典。最终的响应对象是这样的：

{
  "name": "Peach",
  "color": "Orange",
  "taste": "Sweet",
  "marketing_description": "A juicy and flavorful peach, perfect for a summer snack or dessert."
}

步骤 2：并行处理图像

我们在第一步中定义的管道效果很好，但如果要处理成千上万张图片，速度可能会很慢。由于解析一张图片可能需要数秒，因此对大型数据集按顺序运行可能需要数小时甚至数天。幸运的是，Langchain 提供了并行处理的解决方案：chain.batch 函数。

并行运行链

要并行运行所有图像的链，我们首先要准备一个包含每幅图像所需数据的字典列表。然后，我们在链上使用批处理方法，这样就可以同时处理多张图像。

# Prepare the list of image data dictionaries
all_images = [{"language": "English", 
               "format_instructions": parser.get_format_instructions(),
               "image_data": base64.b64encode(httpx.get(url).content).decode("utf-8")} 
              for url in fruits]

工作原理

准备数据：我们创建一个字典列表，每个字典包含每个图像 URL 的语言、格式说明和 base64 编码图像数据。
并行处理：通过在链上使用批处理方法，我们可以并行处理所有提取请求。配置选项 max_concurrency 可帮助管理并发请求的数量，避免触及模型 API 的速率限制。
检索结果：结果对象包含一个字典列表，其中包含每个图像的提取数据。

输出示例

结果对象将包含每个图像的结构化数据，与以下示例类似：

{
  "name": "Apple",
  "color": "Red and Green",
  "taste": "Sweet and Tart",
  "marketing_description": "A crisp and juicy apple with a sweet and tart flavor. Perfect for snacking or baking."
}
{
  "name": "Banana",
  "color": "Yellow",
  "taste": "Sweet",
  "marketing_description": "A delicious and nutritious fruit, perfect for a quick snack or a healthy breakfast. Our bananas are ripe and ready to eat, with a sweet and creamy flavor that everyone will love."
}
...

步骤 3：确保输出包含足够的变化

从上面的例子中可以看出，描述可能非常相似。语言模型在类似任务的相同提示下会生成相似的输出。虽然调整模型的温度可能会减轻这种情况，但如果设置过高，也有可能破坏 JSON 结构。

相似的输出对于搜索引擎优化来说并不是好事，因此我们必须确保模型生成独特的描述。我们可以通过强制模型以随机字母和长度开始输出来实现一些变化。下面是我们为此使用的函数：

import random
def generate_random_letter():
    letters = ['A', 'B', 'C', 'D', 'M', 'P', 'R', 'S', 'T']
    return str(random.choice(letters))
def generate_random_number():
    return int(random.randint(30, 45))

更新提示符

首先，我们更新提示语，加入起始字母和长度变量。

# A new prompt template that includes the marketing description starting with a given letter
prompt = ChatPromptTemplate.from_messages([
    (
        "system", "Return the requested response object in {language}. Make sure the marketing description starts with the letter '{starting_letter}'\n'{format_instructions}'\n"
    ),
    (
        "human", [
            {
                "type": "image_url",
                "image_url": {"url": _"data:image/jpeg;base64,{image_data}"},
            },
        ],
    )
])

为批处理添加随机性

接下来，我们更新用于调用批处理的字典，以便在提示中加入一些随机性。

# Prepare the list of image data dictionaries with added randomness
all_images = [{"language": "English", 
               "format_instructions": parser.get_format_instructions(),
               "image_data": base64.b64encode(httpx.get(url).content).decode("utf-8"),
               "starting_letter": generate_random_letter()} 
              for url in fruits]
chain = prompt | model | parser
# Run the chain in parallel with a specified max concurrency
results = chain.batch(all_images, config={"max_concurrency": 5})
# Print the results
for result in results:
    print(result.json(indent=2))

工作原理

生成随机值：我们使用 generate_random_letter 为每条描述创建一个随机的起始字母，以确保输出的变化。
更新提示：提示包括起始字母变量，迫使模型以该字母开始营销描述。
随机并行处理：通过为 all_images 列表中的每个字典添加起始字母，我们可以在批处理中引入变化。

输出示例

打印结果将显示出更多变化的输出：

{
  "name": "Apple",
  "color": "Red and Green",
  "taste": "Sweet and Tart",
  "marketing_description": "Crisp, juicy, and bursting with flavor, this apple is the perfect snack for any occasion. Enjoy it on its own, or use it in your favorite recipes."
}
{
  "name": "Banana",
  "color": "Yellow",
  "taste": "Sweet",
  "marketing_description": "Bananas are a delicious and versatile fruit that can be enjoyed in many different ways. They are a good source of potassium and fiber, and they are also a good source of vitamins B6 and C. Bananas are a great snack, and they can also be used in smoothies, baked goods, and other recipes. They are also a great source of energy, and they can help to improve your mood."
}
{
  "name": "Kiwi",
  "color": "Green",
  "taste": "Sweet and tangy",
  "marketing_description": "Come and try our delicious kiwi! This green fruit is sweet and tangy, perfect for a healthy snack or a refreshing addition to your smoothies."
}
...

通过在提示中加入随机性，我们可以确保模型生成独特而多样的描述，这有利于搜索引擎优化。

希望这篇内容丰富的指南能激励你在自己的项目中尝试图像元数据提取。

文章来源：https://medium.com/vectrix-ai/image-extraction-with-langchain-and-gemini-a-step-by-step-guide-02c79abcd679

标签：

人工智能 LLM

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇使用FAISS和CLIP构建图像相似性搜索引擎

下一篇从零开始：缺失值插补的实用方法与代码演示

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来