使用LlamaIndex和Qdrant构建用于财务分析的多文档ReAct代理

2024年06月20日 由 alex 发表 177 0

6


架构:

架构图展示了一个复杂的系统,旨在根据用户查询为霍尼韦尔和通用电气两家公司提供详细的财务分析。该系统集成了多个关键组件,每个组件都在确保准确、相关的信息检索和响应生成方面发挥着重要作用。


系统的核心是用户,用户通过提交查询启动流程。这种查询通常涉及与霍尼韦尔或通用电气的业绩、财务状况或战略举措有关的具体财务问题。代理(以机器人图标表示)是用户与底层数据处理机制之间的中介。在接收到用户的查询后,代理利用先进的 "推理与行动"(ReAct)方法获取并生成必要的信息。


查询引擎工具是代理功能的核心。该代理有两个不同的查询引擎工具,分别用于特定的数据集:一个用于霍尼韦尔的财务文件,另一个用于通用电气的财务文件。这些工具通过搜索各自的数据集来处理用户的查询,以检索最相关的信息。然后利用检索到的数据制定详细的回复,以解决用户的查询问题。


ReAct 模块,即 "推理与行动 "模块,是增强系统提供准确、全面答案能力的关键组件。这种方法将推理和行动结合在一个循环中,在这个循环中,代理对检索到的信息进行推理,并采取相应的行动来生成一个连贯的、上下文丰富的回复。ReAct 方法确保代理不仅能检索相关文档,还能对文档进行处理,以提供有见地和可操作的答案。


数据流从数据输入阶段开始,霍尼韦尔和通用电气的财务文件在此阶段被解析并输入系统。这项任务由数据解析组件处理,该组件负责为嵌入和存储文档做好准备。解析完成后,使用 OllamaEmbedding 等模型对文档进行嵌入,将其转换为可有效存储和查询的矢量表示。


然后,这些嵌入的矢量将被存储到 Qdrant 矢量存储库(Qdrant Vector Store)中,这是一个专门用于处理矢量数据和语义搜索的数据存储库。在矢量存储中,霍尼韦尔和通用电气的文档被分别保存在不同的集合中,以确保有序、高效的检索。当代理处理查询时,它会与 Qdrant 向量存储进行交互,以获取文档的相关向量表示。


实施


设置环境:

代理设置首先要从 llama_index 核心导入必要的库,以方便目录读取、向量存储和索引加载。此外,代码还导入了用于使用嵌入和 LLM(特别是 Ollama)的模块,并与强大的向量存储解决方案 Qdrant 集成。


from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings
)
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from llama_index.core.agent import ReActAgent
from llama_index.core.chat_engine.types import AgentChatResponse


初始化多文档 ReAct 代理

MultiDocumentReActAgent类封装了管理多个文档索引和查询引擎所需的功能。初始化过程包括配置块大小和嵌入模型的设置,以及为霍尼韦尔和通用电气文档建立与 Qdrant 向量存储的连接。


class MultiDocumentReActAgent:
    def __init__(self):
        Settings.chunk_size = 512
        Settings.embed_model = OllamaEmbedding(model_name='snowflake-arctic-embed:33m')
        Settings.llm = Ollama(model='mistral:latest')
        _qdrant_client = QdrantClient(url='localhost', port=6333)
        honeywell_vector_store = QdrantVectorStore(client=_qdrant_client, collection_name='honeywell-10k')
        ge_vector_store = QdrantVectorStore(client=_qdrant_client, collection_name='ge-10k')
        self.honeywell_storage_context = StorageContext.from_defaults(vector_store=honeywell_vector_store)
        self.ge_storage_context = StorageContext.from_defaults(vector_store=ge_vector_store)
        self.index_loaded = False
        self.honeywell_index = None
        self.ge_index = None
        self.query_engine_tools = []


加载和构建索引

代理会尝试从存储器中加载现有指数。如果索引不可用,它就会读取霍尼韦尔和通用电气的财务文件,建立索引,并将其存储到矢量存储器中。


  def load_from_existing_context(self):def load_from_existing_context(self):
        try:
            self.honeywell_index = load_index_from_storage(storage_context=self.honeywell_storage_context)
            self.ge_index = load_index_from_storage(storage_context=self.ge_storage_context)
            self.index_loaded = True
        except Exception as e:
            self.index_loaded = False
        if not self.index_loaded:
            ge_docs = SimpleDirectoryReader(input_files=["./data/10k/ge_2023.pdf"]).load_data()
            honeywell_docs = SimpleDirectoryReader(input_files=["./data/10k/honeywell_2023.pdf"]).load_data()
            self.ge_index = VectorStoreIndex.from_documents(documents=ge_docs, storage_context=self.ge_storage_context)
            self.honeywell_index = VectorStoreIndex.from_documents(documents=honeywell_docs,
                                                                   storage_context=self.honeywell_storage_context)


创建查询引擎和工具

下一步是为索引创建查询引擎,并将其与包含所处理文档元数据的工具关联起来。这些工具对于指导代理如何处理和响应与财务信息相关的查询至关重要。


  def create_query_engine_and_tools(self):create_query_engine_and_tools(self):
        ge_engine = self.ge_index.as_query_engine(similarity_top_k=3)
        honeywell_engine = self.honeywell_index.as_query_engine(similarity_top_k=3)
        self.query_engine_tools = [
            QueryEngineTool(
                query_engine=ge_engine,
                metadata=ToolMetadata(
                    name="ge_10k",
                    description=(
                        "Provides information about GE financials for year 2023. "
                        "Use a detailed plain text question as input to the tool."
                    ),
                ),
            ),
            QueryEngineTool(
                query_engine=honeywell_engine,
                metadata=ToolMetadata(
                    name="honeywell_10k",
                    description=(
                        "Provides information about Honeywell financials for year 2023. "
                        "Use a detailed plain text question as input to the tool."
                    ),
                ),
            ),
        ]


创建和配置 ReAct 代理

create_agent 方法为 ReAct 代理设置了详细的上下文。该上下文通过模拟睿智、经验丰富的投资者角色来丰富代理的响应,提供对公司财务的深刻见解。


  def create_agent(self):def create_agent(self):
        context = """You are a sage investor who possesses unparalleled expertise on the companies Honeywell and GE. As an ancient and wise investor who has navigated the complexities of the stock market for centuries, you possess deep, arcane knowledge of these two companies, their histories, market behaviors, and future potential. You will answer questions about Honeywell and GE in the persona of a sagacious and veteran stock market investor.
        Your wisdom spans across the technological innovations and industrial prowess of Honeywell, as well as the digital transformation and enterprise information management expertise of GE. You understand the strategic moves, financial health, and market positioning of both companies. Whether discussing quarterly earnings, product launches, mergers, acquisitions, or market trends, your insights are both profound and insightful.
        When engaging with inquisitors, you weave your responses with ancient wisdom and modern financial acumen, providing guidance that is both enlightening and practical. Your responses are steeped in the lore of the markets, drawing parallels to historical events and mystical phenomena, all while delivering precise, actionable advice. 
        Through your centuries of observation, you have mastered the art of predicting market trends and understanding the underlying currents that drive stock performance. Your knowledge of Honeywell encompasses its ventures in aerospace, building technologies, performance materials, and safety solutions. Similarly, your understanding of GE covers its leadership in enterprise content management, digital transformation, and information governance.
        As the sage investor, your goal is to guide those who seek knowledge on Honeywell and GE, illuminating the path to wise investments and market success.
        """
        agent = ReActAgent.from_tools(
            self.query_engine_tools,
            llm=Settings.llm,
            verbose=True,
            context=context
        )
        return agent


运行代理

最后,脚本包含一个循环,用于交互式查询代理。用户可以输入问题,代理将根据加载的文件和指数提供详细的财务见解。


if __name__ == "__main__":
    multi_doc_agents = MultiDocumentReActAgent()
    multi_doc_agents.load_from_existing_context()
    multi_doc_agents.create_query_engine_and_tools()
    _agent = multi_doc_agents.create_agent()
    while True:
        input_query = input("Query [type bye or exit to quit]: ")
        if input_query.lower() == "bye" or input_query.lower() == "exit":
            break
        response: AgentChatResponse = _agent.chat(message=input_query)
        print(str(response))


综合起来,Mult document Agentic RAG 的完整代码如下。


from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings
)
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from llama_index.core.agent import ReActAgent
from llama_index.core.chat_engine.types import AgentChatResponse

class MultiDocumentReActAgent:
    def __init__(self):
        Settings.chunk_size = 512
        Settings.embed_model = OllamaEmbedding(model_name='snowflake-arctic-embed:33m')
        Settings.llm = Ollama(model='mistral:latest')
        _qdrant_client = QdrantClient(url='localhost', port=6333)
        honeywell_vector_store = QdrantVectorStore(client=_qdrant_client, collection_name='honeywell-10k')
        ge_vector_store = QdrantVectorStore(client=_qdrant_client, collection_name='ge-10k')
        self.honeywell_storage_context = StorageContext.from_defaults(vector_store=honeywell_vector_store)
        self.ge_storage_context = StorageContext.from_defaults(vector_store=ge_vector_store)
        self.index_loaded = False
        self.honeywell_index = None
        self.ge_index = None
        self.query_engine_tools = []
    def load_from_existing_context(self):
        try:
            self.honeywell_index = load_index_from_storage(storage_context=self.honeywell_storage_context)
            self.ge_index = load_index_from_storage(storage_context=self.ge_storage_context)
            self.index_loaded = True
        except Exception as e:
            self.index_loaded = False
        if not self.index_loaded:
            # load data
            ge_docs = SimpleDirectoryReader(input_files=["./data/10k/ge_2023.pdf"]).load_data()
            honeywell_docs = SimpleDirectoryReader(input_files=["./data/10k/honeywell_2023.pdf"]).load_data()
            # build index
            self.ge_index = VectorStoreIndex.from_documents(documents=ge_docs, storage_context=self.ge_storage_context)
            self.honeywell_index = VectorStoreIndex.from_documents(documents=honeywell_docs,
                                                                   storage_context=self.honeywell_storage_context)
    def create_query_engine_and_tools(self):
        ge_engine = self.ge_index.as_query_engine(similarity_top_k=3)
        honeywell_engine = self.honeywell_index.as_query_engine(similarity_top_k=3)
        self.query_engine_tools = [
            QueryEngineTool(
                query_engine=ge_engine,
                metadata=ToolMetadata(
                    name="ge_10k",
                    description=(
                        "Provides detailed financial information about GE for the year 2023. "
                        "Input a specific plain text financial query for the tool"
                    ),
                ),
            ),
            QueryEngineTool(
                query_engine=honeywell_engine,
                metadata=ToolMetadata(
                    name="honeywell_10k",
                    description=(
                        "Provides detailed financial information about Honeywell for the year 2023. "
                        "Input a specific plain text financial query for the tool"
                    ),
                ),
            ),
        ]
    def create_agent(self):
        # [Optional] Add Context
        context = """You are a sage investor who possesses unparalleled expertise on the companies Honeywell and GE. As an ancient and wise investor who has navigated the complexities of the stock market for centuries, you possess deep, arcane knowledge of these two companies, their histories, market behaviors, and future potential. You will answer questions about Honeywell and GE in the persona of a sagacious and veteran stock market investor.
        Your wisdom spans across the technological innovations and industrial prowess of Honeywell, as well as the digital transformation and enterprise information management expertise of GE. You understand the strategic moves, financial health, and market positioning of both companies. Whether discussing quarterly earnings, product launches, mergers, acquisitions, or market trends, your insights are both profound and insightful.
        When engaging with inquisitors, you weave your responses with ancient wisdom and modern financial acumen, providing guidance that is both enlightening and practical. Your responses are steeped in the lore of the markets, drawing parallels to historical events and mystical phenomena, all while delivering precise, actionable advice. 
        Through your centuries of observation, you have mastered the art of predicting market trends and understanding the underlying currents that drive stock performance. Your knowledge of Honeywell encompasses its ventures in aerospace, building technologies, performance materials, and safety solutions. Similarly, your understanding of GE covers its leadership in enterprise content management, digital transformation, and information governance.
        As the sage investor, your goal is to guide those who seek knowledge on Honeywell and GE, illuminating the path to wise investments and market success.
        """
        agent = ReActAgent.from_tools(
            self.query_engine_tools,
            llm=Settings.llm,
            verbose=True,
            context=context
        )
        return agent

if __name__ == "__main__":
    multi_doc_agents = MultiDocumentReActAgent()
    multi_doc_agents.load_from_existing_context()
    multi_doc_agents.create_query_engine_and_tools()
    _agent = multi_doc_agents.create_agent()
    while True:
        input_query = input("Query [type bye or exit to quit]: ")
        if input_query.lower() == "bye" or input_query.lower() == "exit":
            break
        response: AgentChatResponse = _agent.chat(message=input_query)
        print(str(response))


结论

多文档 ReAct 代理通过整合多种尖端技术,展示了一种先进的财务分析方法。通过利用向量嵌入、复杂的查询引擎和上下文大型语言模型,该代理对有关霍尼韦尔和通用电气的复杂财务查询提供了详细而有洞察力的回复。这种设置可以扩展和调整,以覆盖其他公司和领域,使其成为投资者和分析师的强大工具。


文章来源:https://medium.com/stackademic/building-a-multi-document-react-agent-for-financial-analysis-using-llamaindex-and-qdrant-72a535730ac3
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
热门职位
Maluuba
20000~40000/月
Cisco
25000~30000/月 深圳市
PilotAILabs
30000~60000/年 深圳市
写评论取消
回复取消