使用 Gemini 为任何类型的 PDF 构建文档 AI 管道

2024年12月27日由 alex 发表 292 0

自动化文档处理是ChatGPT革命中的最大赢家之一，因为大型语言模型（LLM）能够在零样本设置下（即没有领域内标注的训练数据）处理广泛的主题和任务。这使得构建用于处理、解析和自动理解任意文档的AI驱动应用程序变得更加容易。尽管使用LLM的天真方法仍然受到非文本上下文（如图表、图像和表格）的限制，但本文将尝试解决这一问题，特别关注PDF文件。

在基础层面，PDF只是字符、图像和线条以及它们的确切坐标的集合。它们没有固有的“文本”结构，并不是为了作为文本处理而构建的，而只是为了按原样查看。这正是处理它们困难的原因，因为仅文本的方法无法捕捉这些类型文档中的所有布局和视觉元素，从而导致上下文和信息的显著损失。

绕过这种“仅文本”限制的一种方法是在将文档输入LLM之前，通过检测表格、图像和布局对文档进行大量预处理。表格可以解析为Markdown或JSON，图像和图表可以用其标题表示，文本可以按原样输入。然而，这种方法需要自定义模型，并且仍然会导致一些信息丢失，那么我们能否做得更好呢？

多模态LLM

最近的大多数大型模型现在都是多模态的，这意味着它们可以处理多种模态，如文本、代码和图像。这为我们的问题开辟了一个更简单的解决方案，即一个模型一次性完成所有工作。因此，我们无需为图像添加标题和解析表格，而是可以直接将页面作为图像输入并按原样处理。我们的管道将能够加载PDF，将每一页提取为图像，将其拆分为块（使用LLM），并对每个块进行索引。如果检索到某个块，则将完整页面包含在LLM上下文中以执行任务。接下来，我们将详细说明这在实践中如何实现。

管道

我们正在实现的管道是一个两步过程。首先，我们将每一页分割为重要的块，并对每个块进行总结。其次，我们对块进行一次索引，然后在每次收到请求时搜索这些块，并在LLM上下文中包含每个检索到的块的完整上下文。

第1步：页面分割和总结

我们将页面提取为图像，并将每个页面传递给多模态LLM以进行分割。像Gemini这样的模型可以轻松理解和处理页面布局：

表格被识别为一个块。
图表形成另一个块。
文本块被分割为单独的块。
...

对于每个元素，LLM生成一个摘要，该摘要可以嵌入并索引到向量数据库中。

第2步：嵌入和上下文检索

在本教程中，为了简化，我们将仅使用文本嵌入，但一个改进是直接使用视觉嵌入。

数据库中的每个条目包括：

块的摘要。
找到它的页码。
指向完整页面的图像表示的链接，以提供额外上下文。

该模式允许在本地级别（块级别）进行搜索，同时跟踪上下文（通过链接回完整页面）。例如，如果搜索查询检索到一个项目，代理可以包含整个页面图像，以在生成响应时为LLM提供完整的布局和额外上下文，从而最大限度地提高响应质量。

通过提供完整图像，所有视觉线索和重要的布局信息（如图像、标题、项目符号等）以及相邻项目（表格、段落等）在LLM生成响应时都可用。

代理

我们将每个步骤实现为单独的、可重用的代理：

第一个代理用于解析、分块和总结。这涉及将文档分割为重要的块，然后为每个块生成摘要。此代理只需对每个PDF运行一次以预处理文档。

第二个代理管理索引、搜索和检索。这包括将块的嵌入插入向量数据库以进行高效搜索。索引对每个文档执行一次，而搜索可以根据不同查询的需要重复多次。

对于这两个代理，我们使用Gemini，这是一个具有强大视觉理解能力的多模态LLM。

解析和分块代理

第一个代理负责将每一页分割为有意义的块，并对每个块进行总结，遵循以下步骤：

第一步：将PDF页面提取为图像

我们使用pdf2image库。然后，将图像编码为Base64格式，以简化将其添加到LLM请求中的过程。

以下是实现方法：

from document_ai_agents.document_utils import extract_images_from_pdf
from document_ai_agents.image_utils import pil_image_to_base64_jpeg
from pathlib import Path
class DocumentParsingAgent:
    @classmethod
    def get_images(cls, state):
        """
        Extract pages of a PDF as Base64-encoded JPEG images.
        """
        assert Path(state.document_path).is_file(), "File does not exist"
        # Extract images from PDF
        images = extract_images_from_pdf(state.document_path)
        assert images, "No images extracted"
        # Convert images to Base64-encoded JPEG
        pages_as_base64_jpeg_images = [pil_image_to_base64_jpeg(x) for x in images]
        return {"pages_as_base64_jpeg_images": pages_as_base64_jpeg_images}

extract_images_from_pdf：将 PDF 的每一页提取为 PIL 图像。

pil_image_to_base64_jpeg：将图像转换为Base64编码的JPEG格式。

第二步：分块和摘要

然后，将每个图像发送到大型语言模型（LLM）进行分割和摘要。我们使用结构化输出，以确保我们获得预期格式的预测结果：

from pydantic import BaseModel, Field
from typing import Literal
import json
import google.generativeai as genai
from langchain_core.documents import Document
class DetectedLayoutItem(BaseModel):
    """
    Schema for each detected layout element on a page.
    """
    element_type: Literal["Table", "Figure", "Image", "Text-block"] = Field(
        ..., 
        description="Type of detected item. Examples: Table, Figure, Image, Text-block."
    )
    summary: str = Field(..., description="A detailed description of the layout item.")
class LayoutElements(BaseModel):
    """
    Schema for the list of layout elements on a page.
    """
    layout_items: list[DetectedLayoutItem] = []
class FindLayoutItemsInput(BaseModel):
    """
    Input schema for processing a single page.
    """
    document_path: str
    base64_jpeg: str
    page_number: int
class DocumentParsingAgent:
    def __init__(self, model_name="gemini-1.5-flash-002"):
        """
        Initialize the LLM with the appropriate schema.
        """
        layout_elements_schema = prepare_schema_for_gemini(LayoutElements)
        self.model_name = model_name
        self.model = genai.GenerativeModel(
            self.model_name,
            generation_config={
                "response_mime_type": "application/json",
                "response_schema": layout_elements_schema,
            },
        )
    def find_layout_items(self, state: FindLayoutItemsInput):
        """
        Send a page image to the LLM for segmentation and summarization.
        """
        messages = [
            f"Find and summarize all the relevant layout elements in this PDF page in the following format: "
            f"{LayoutElements.schema_json()}. "
            f"Tables should have at least two columns and at least two rows. "
            f"The coordinates should overlap with each layout item.",
            {"mime_type": "image/jpeg", "data": state.base64_jpeg},
        ]
        # Send the prompt to the LLM
        result = self.model.generate_content(messages)
        data = json.loads(result.text)
        
        # Convert the JSON output into documents
        documents = [
            Document(
                page_content=item["summary"],
                metadata={
                    "page_number": state.page_number,
                    "element_type": item["element_type"],
                    "document_path": state.document_path,
                },
            )
            for item in data["layout_items"]
        ]
        return {"documents": documents}

LayoutElements模式定义了输出的结构，包括每种布局元素类型（表格、图表……）及其摘要。

第三步：页面的并行处理

为了提高速度，页面会进行并行处理。以下方法创建了一个任务列表，以便一次性处理所有页面图像，因为处理过程是I/O密集型的：

from langgraph.types import Send
class DocumentParsingAgent:
    @classmethod
    def continue_to_find_layout_items(cls, state):
        """
        Generate tasks to process each page in parallel.
        """
        return [
            Send(
                "find_layout_items",
                FindLayoutItemsInput(
                    base64_jpeg=base64_jpeg,
                    page_number=i,
                    document_path=state.document_path,
                ),
            )
            for i, base64_jpeg in enumerate(state.pages_as_base64_jpeg_images)
        ]

每个页面作为独立任务被发送到find_layout_items函数。

完整工作流程

代理的工作流程是使用状态图（StateGraph）构建的，将图像提取和布局检测步骤链接成一个统一的管道。

from langgraph.graph import StateGraph, START, END
class DocumentParsingAgent:
    def build_agent(self):
        """
        Build the agent workflow using a state graph.
        """
        builder = StateGraph(DocumentLayoutParsingState)
        
        # Add nodes for image extraction and layout item detection
        builder.add_node("get_images", self.get_images)
        builder.add_node("find_layout_items", self.find_layout_items)
        # Define the flow of the graph
        builder.add_edge(START, "get_images")
        builder.add_conditional_edges("get_images", self.continue_to_find_layout_items)
        builder.add_edge("find_layout_items", END)
        
        self.graph = builder.compile()

要在示例PDF上运行该代理，我们执行以下操作：

if __name__ == "__main__":
    _state = DocumentLayoutParsingState(
        document_path="path/to/document.pdf"
    )
    agent = DocumentParsingAgent()
    
    # Step 1: Extract images from PDF
    result_images = agent.get_images(_state)
    _state.pages_as_base64_jpeg_images = result_images["pages_as_base64_jpeg_images"]
    
    # Step 2: Process the first page (as an example)
    result_layout = agent.find_layout_items(
        FindLayoutItemsInput(
            base64_jpeg=_state.pages_as_base64_jpeg_images[0],
            page_number=0,
            document_path=_state.document_path,
        )
    )
    # Display the results
    for item in result_layout["documents"]:
        print(item.page_content)
        print(item.metadata["element_type"])

这将生成PDF的解析、分割和摘要表示，这是我们接下来将构建的第二个代理的输入。

RAG（检索增强生成）代理

第二个代理负责索引和检索部分。它将第一个代理的文档保存到向量数据库中，并使用这些结果进行检索。这可以分为两个独立的步骤：索引和检索。

第一步：为分割后的文档创建索引

使用生成的摘要，我们将其向量化并保存到ChromaDB数据库中：

class DocumentRAGAgent:
    def index_documents(self, state: DocumentRAGState):
        """
        Index the parsed documents into the vector store.
        """
        assert state.documents, "Documents should have at least one element"
        # Check if the document is already indexed
        if self.vector_store.get(where={"document_path": state.document_path})["ids"]:
            logger.info(
                "Documents for this file are already indexed, exiting this node"
            )
            return  # Skip indexing if already done
        # Add parsed documents to the vector store
        self.vector_store.add_documents(state.documents)
        logger.info(f"Indexed {len(state.documents)} documents for {state.document_path}")

index_documents方法将片段摘要嵌入到向量存储中。我们保留元数据，如文档路径和页码，以供后续使用。

第二步：处理问题

当用户提出问题时，代理在向量存储中搜索最相关的片段。它检索摘要和相应的页面图像，以便进行上下文理解。

class DocumentRAGAgent:
    def answer_question(self, state: DocumentRAGState):
        """
        Retrieve relevant chunks and generate a response to the user's question.
        """
        # Retrieve the top-k relevant documents based on the query
        relevant_documents: list[Document] = self.retriever.invoke(state.question)
        # Retrieve corresponding page images (avoid duplicates)
        images = list(
            set(
                [
                    state.pages_as_base64_jpeg_images[doc.metadata["page_number"]]
                    for doc in relevant_documents
                ]
            )
        )
        logger.info(f"Responding to question: {state.question}")
        # Construct the prompt: Combine images, relevant summaries, and the question
        messages = (
            [{"mime_type": "image/jpeg", "data": base64_jpeg} for base64_jpeg in images]
            + [doc.page_content for doc in relevant_documents]
            + [
                f"Answer this question using the context images and text elements only: {state.question}",
            ]
        )
        # Generate the response using the LLM
        response = self.model.generate_content(messages)
        return {"response": response.text, "relevant_documents": relevant_documents}

检索器查询向量存储，以找到与用户问题最相关的片段。然后，我们为LLM（语言模型，此处指Gemini）构建上下文，该上下文结合了文本片段和图像以生成响应。

完整的代理工作流程

代理工作流程分为两个阶段：索引阶段和问题回答阶段：

class DocumentRAGAgent:
    def build_agent(self):
        """
        Build the RAG agent workflow.
        """
        builder = StateGraph(DocumentRAGState)
        # Add nodes for indexing and answering questions
        builder.add_node("index_documents", self.index_documents)
        builder.add_node("answer_question", self.answer_question)
        # Define the workflow
        builder.add_edge(START, "index_documents")
        builder.add_edge("index_documents", "answer_question")
        builder.add_edge("answer_question", END)
        self.graph = builder.compile()

示例运行

if __name__ == "__main__":
    from pathlib import Path
  # Import the first agent to parse the document
    from document_ai_agents.document_parsing_agent import (
        DocumentLayoutParsingState,
        DocumentParsingAgent,
    )
    # Step 1: Parse the document using the first agent
    state1 = DocumentLayoutParsingState(
        document_path=str(Path(__file__).parents[1] / "data" / "docs.pdf")
    )
    agent1 = DocumentParsingAgent()
    result1 = agent1.graph.invoke(state1)
    # Step 2: Set up the second agent for retrieval and answering
    state2 = DocumentRAGState(
        question="Who was acknowledged in this paper?",
        document_path=str(Path(__file__).parents[1] / "data" / "docs.pdf"),
        pages_as_base64_jpeg_images=result1["pages_as_base64_jpeg_images"],
        documents=result1["documents"],
    )
    agent2 = DocumentRAGAgent()
    # Index the documents
    agent2.graph.invoke(state2)
    # Answer the first question
    result2 = agent2.graph.invoke(state2)
    print(result2["response"])
    # Answer a second question
    state3 = DocumentRAGState(
        question="What is the macro average when fine-tuning on PubLayNet using M-RCNN?",
        document_path=str(Path(__file__).parents[1] / "data" / "docs.pdf"),
        pages_as_base64_jpeg_images=result1["pages_as_base64_jpeg_images"],
        documents=result1["documents"],
    )
    result3 = agent2.graph.invoke(state3)
    print(result3["response"])

通过此实现，文档处理、检索和问题回答的管道已完整。

示例：使用Document AI管道

让我们通过一个实际示例来演示，使用document LLM & Adaptation.pdf，这是一组包含文本、方程和图形的39张幻灯片（CC BY 4.0）。

第一步：解析和总结文档（代理1）

执行时间：解析39页文档耗时29秒。
结果：代理1生成一个索引文档，其中包含片段摘要和每页的base64编码JPEG图像。

第二步：对文档提问（代理2）

我们提出以下问题：

“解释LoRA，给出相关方程”

结果：

检索到的页面：

来自语言模型（LLM）的回复

语言模型通过利用视觉上下文，在基于文档生成连贯且正确的回复时，成功地将方程和图形纳入其中。

结论

在这个快速教程中，我们了解了如何通过利用最新语言模型的多模态特性，并充分利用每个文档中可用的完整视觉上下文，使你的文档AI处理管道更进一步，从而有望提高你从信息提取或检索增强生成（RAG）管道中获得的输出质量。

我们构建了一个更强大的文档分割步骤，能够检测段落、表格和图形等重要项目并进行总结，然后使用这一第一步的结果来查询项目和页面集合，以使用Gemini给出相关且精确的答案。作为下一步，你可以在自己的用例和文档上尝试使用它，尝试使用可扩展的向量数据库，并将这些代理作为AI应用程序的一部分进行部署。

文章来源：https://towardsdatascience.com/build-a-document-ai-pipeline-for-any-type-of-pdf-with-gemini-9221c8e143db

标签：

人工智能 LLM

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇使用YOLO和Tensorflow模型同时执行对象检测和图像分类

下一篇如何使用Neo4j和LangChain构建知识图谱

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来