【Gemini Vision】改变图像数据提取综合指南

2024年02月22日由 alex 发表 473 0

在大型语言模型（LLM）的动态环境中，随着Google Gemini 于 12 月 13 日作为 API 推出，近期多模态集成的热潮达到了顶峰。OpenAI 的 GPT-4 率先实现了这一转变，而 Gemini 则为探索不同数据类型增添了新的维度。本文探讨了为图像提示而精心设计的Gemini Vision 模型的变革潜力。本文的重点是揭示基于图像的场景中的无数应用，特别强调数据提取和使用这一创新模型开发应用。随着 LLM 进入多模态时代，了解 Gemini 功能的细微差别，就能一窥语言模型的未来，这些模型能无缝整合各种数据形式，为用户带来更身临其境、更广阔的体验。

一个简单的例子

要使用Gemini Vision模型，我们需要通过 ChatGoogleGenerativeAI 类加载该模型。随后，我们利用 HumanMessage 模块的功能制作信息提示。准备工作包括将内容结构化为一个字典列表，每个字典都有一个 "类型 "和相应的内容。“type”可以采用“text”或“image_url”的值。这种细致的结构符合通过 Langchain Integration 传递查询所需的规范。

!pip install -U --quiet langchain-google-genai  langchain--quiet langchain-google-genai  langchain

import requests
from IPython.display import Image
image_url = "https://picsum.photos/seed/picsum/300/300"
content = requests.get(image_url).content
Image(content)

llm = ChatGoogleGenerativeAI(model="gemini-pro-vision")

message = HumanMessage(
    content=[
        {
            "type": "text","type": "text",
            "text": "What's in this image?",
        },  # You can optionally provide text parts
        {"type": "image_url", "image_url": image_url},
    ]
)
llm.invoke([message])

AIMessage(content=' The image contains a snow-capped mountain peak.')

从发票或账单图像中提取数据

在充斥着各种格式发票的商业环境中，多模态大语言模型 Gemini 提供了一种改变游戏规则的解决方案。通过向模型提供发票图像提示，它可以智能地提取供应商详细信息和发票金额等关键数据。Gemini 不仅能识别这些信息，还能将其结构化，以便轻松集成到财务系统中。这种简化的流程大大加快并增强了发票处理能力，为企业节省了宝贵的时间和资源。

示例图片：

#create the humanmassage propmt templete with the image file 
hmessage = HumanMessage(
    content=[
        {
            "type": "text",
            "text": "Convert Invoice data into json format with appropriate json tags as required for the data in image ",
        },
        {"type": "image_url", "image_url": file_path},
    ]
)
message = llm.invoke([hmessage])
print(message.content)

输出：

 ```json
{
  "Invoice Summary": {
    "Invoice Number": "498711077",
    "Invoice Date": "July 3, 2020",
    "Total Amount Due on July 3, 2020": "$2,657.68"
  },
  "Summary": {
    "AWS Service Charges": "$2,657.68",
    "Credits": "$0.00",
    "Tax": "$0.00",
    "Total for this Invoice": "$2,657.68"
  },
  "Detail": {
    "Amazon Simple Storage Service": "$0.94",
    "AWS Data Transfer": "$2,656.74",
    "AmazonCloudWatch": "$0.00",
    "AWS Key Management Service": "$0.00",
    "Amazon Simple Queue Service": "$0.00"
  }
}
```

你还可以使用 JsonOutputParser 获取正确的 JSON 格式

from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser()

chain =  llm | parser
bill_json = chain.invoke([hmessage])
print(type(bill_json))
print(bill_json)

输出：

<class 'dict'>
{'Invoice Summary': {'Invoice Number': '498711077', 'Invoice Date': 'July 3, 2020', 'Total Amount Due on July 3, 2020': '$2,657.68'}, 'Summary': {'AWS Service Charges': '$2,657.68', 'Credits': '$0.00', 'Tax': '$0.00', 'Total for this Invoice': '$2,657.68'}, 'Detail': [{'Amazon Simple Storage Service': '$0.94', 'AWS Data Transfer': '$2,656.74', 'AmazonCloudWatch': '$0.00', 'AWS Key Management Service': '$0.00', 'Amazon Simple Queue Service': '$0.00'}]

从产品标签图像中提取数据

在零售业或制造业中，处理不同的产品标签可能非常繁琐。Gemini 是一种多模态大语言模型，可以简化从产品标签图像中提取数据的过程。通过图像提示功能，Gemini 可智能识别并提取产品名称和营养成分等关键信息，实现流程自动化，节省时间，并最大限度地减少人工转录带来的错误。这使企业能够有效地管理产品信息，提高工作流程效率。

示例图片：

product_msg = HumanMessage(
    content=[
        {
            "type": "text","type": "text",
            "text": "Create a json with following tags extracted from image and use information only from image for value of each tag - 'product_name','manufactured_date','expiry_date','manufactured_by','marketed_by','ingredients'",
        },  # You can optionally provide text parts
        {"type": "image_url", "image_url": image},
    ]
)
prod_output = llm.invoke([product_msg])
print(prod_output.content)

我们特别要求提供这些 "生产日期"、"保质期"、"生产商"、"销售商"、"成分 "信息。

```json
{
  "product_name": "NIVEA Micellar Water",
  "manufactured_date": "2022-07-18",
  "expiry_date": "2024-07-18",
  "manufactured_by": "Nivea",
  "marketed_by": "Nivea",
  "ingredients": "Aqua, Glycerin, Poloxamer 124, Rosa Canina Fruit Extract, Sodium Hyaluronate, Allantoin, Propylene Glycol, PEG-40 Hydrogenated Castor Oil, Sodium Chloride, Citric Acid, Tetrasodium EDTA, Methylparaben, Phenoxyethanol, Parfum"
}
```

尝试提出后续问题

Gemini Vision API 的设计结构只接受两个参数： “text”和“image_url”。这种有意的限制意味着直接实现会话链是不可行的。该应用程序接口将重点放在文本和图像 URL 输入上，从而将简单性和效率放在首位。

虽然可能不支持直接会话链，但可以灵活地通过添加 "文本 "参数来增强交互。这一战略性的设计选择鼓励用户使用相关文本信息来补充图像提示，从而丰富语境并引导模型理解。通过加入更多文本参数，用户可以根据具体要求调整输入内容，并从Gemini Vision模型中获得更细致入微的结果。

示例图片：

message = HumanMessage(
    content=[
        {
            "type": "text","type": "text",
            "text": "Who is this Pokemon?",
        },  # You can optionally provide text parts
        {"type": "image_url", "image_url": file_path},
    ]
)
message_output = llm.invoke([message])
print(message_output.content)

This is Pikachu, a well-known Pokemon character.is Pikachu, a well-known Pokemon character.

现在让我们检查一下 HumanMessage

# lets check what inside user massage
message
# output
HumanMessage(content=[{'type': 'text', 'text': 'Who is this Pokemon?'}, {'type': 'image_url', 'image_url': '/content/download (10).png'}])

内容是一个列表，因此我们可以在列表中添加新文本。

# we can add text into content
# we will add the message into the 2nd last position
message.content.insert(-1,  {
            "type": "text",
            "text": f"{message_output.content}",
        })
# add new use message
new_query = "what types of attack he knows?"
message.content.insert(-1,  {
            "type": "text",
            "text": f"{new_query}",
        })

message_output = llm.invoke([message])
print(message_output.content)

Electric attacks are Pikachu's specialty, but it can also learn other types of attacks, such as Normal, Flying, and Steel-type attacks. Some of the attacks Pikachu can learn include Thunderbolt, Quick Attack, Iron Tail, and Agility.include Thunderbolt, Quick Attack, Iron Tail, and Agility.

结论

总之，以Google Gemini系列为代表的多模态大型语言模型（LLM）的出现标志着人工智能能力的重大范式转变。这些模型，包括 Gemini Nano、Pro 和 Ultra，通过无缝集成文本、图像、音频和视频输入，重新定义了这一领域。从自然语言理解到复杂的视频和音频处理任务，这些模型展现出了多功能性。

Gemini 在现实世界中的应用，通过财务分析、发票解析和产品标签解释等场景进行了说明，展示了其在自动化各种任务方面的变革潜力。尽管取得了显著成绩，但必须承认目前的局限性，例如在图像解读中偶尔会出现幻觉，这凸显了多模态 LLM 的不断发展。

文章来源：https://medium.com/@mohammed97ashraf/revolutionizing-image-data-extraction-a-comprehensive-guide-to-gemini-pro-vision-and-langchain-200bbc60b949

标签：

Gemini Vision Google

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇机器学习模型：深入了解Evidencely AI

下一篇【指南】通过递归特征消除增强机器学习模型

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来