利用LangChain ReAct Agents、Qdrant和Llama3实现智能信息检索

2024年05月28日 由 alex 发表 216 0

简介

在自然语言处理(NLP)和信息检索领域,结合尖端工具和模型对于提供高效的查询处理和有洞察力的响应至关重要。本文探讨了 LangChain 的 ReAct Agents、Qdrant 向量数据库和来自 Groq 终端的 Llama3 大型语言模型(LLM)这三种强大的技术如何协同工作,为智能信息检索系统增添动力。


事实:代理不过是带有复杂提示的 LLM。


设置环境

这些是我们将使用的必要软件包。


使用的技术:


  1. PyPDF2 -> 从 PDF 中提取文本
  2. Qdrant -> 使用 Qdrant 作为向量数据库
  3. LangChain -> 用于代理创建和文本处理
  4. ChatGroq -> 用于通过 Groq 为 LLM 提供 API 端点,以实现快速推理
  5. Gradio -> 用于聊天界面的用户界面


import os
from PyPDF2 import PdfReader
import numpy as np
from langchain_community.vectorstores import Qdrant
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
import numpy as np
from langchain.tools.retriever import create_retriever_tool
from langchain.agents import Tool
from langchain import hub
from langchain.agents import AgentExecutor, create_react_agent
from langchain_community.llms import OpenAI
from langchain_groq import ChatGroq
import gradio as gr
from fuzzywuzzy import fuzz
from fuzzywuzzy import process


我们需要 Groq API 进行推理。目前,它们的测试版是免费的,但有费率限制。链接:https://console.groq.com/keys


您可以在本地或 Qdrant 云中使用 Qdrant。其应用程序接口也是免费的,您可以访问: https://cloud.qdrant.io/login


数据准备

环境设置完成后,下一步就是数据提取。我们首先从指定目录中存储的 PDF 文档中提取文本。然后对提取的文本进行处理和存储,以备进一步使用。


def extract_text_from_pdf(pdf_path):
    """
    Extract text content from a PDF file.
    Args:
    - pdf_path (str): The path to the PDF file.
    Returns:
    - str: The extracted text content.
    """
    reader = PdfReader(pdf_path)
    extracted_text = ""
    for page in reader.pages:
        extracted_text += page.extract_text()
    return extracted_text
def extract_text_from_pdfs_in_directory(directory):
    """
    Extract text content from all PDF files in a directory and save as text files.
    Args:
    - directory (str): The path to the directory containing PDF files.
    """
    for filename in os.listdir(directory):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(directory, filename)
            extracted_text = extract_text_from_pdf(pdf_path)
            txt_filename = os.path.splitext(filename)[0] + ".txt"
            txt_filepath = os.path.join(directory, txt_filename)
            with open(txt_filepath, "w") as txt_file:
                txt_file.write(extracted_text)
# Specify the directory containing PDF files
directory_path = "Docs/"
# Extract text from PDFs in the directory and save as text files
extract_text_from_pdfs_in_directory(directory_path)


存储文件数据文件:


directory_path = "Docs"
txt_files = [file for file in os.listdir(directory_path) if file.endswith('.txt')]
all_documents = {}
for txt_file in txt_files:
    loader = TextLoader(os.path.join(directory_path, txt_file))
    documents = loader.load()
    # Step 2: Split documents into chunks and add metadata
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separator="\n")
    docs = text_splitter.split_documents(documents)
    for doc in docs:
        doc.metadata["source"] = txt_file  # Add source metadata
    all_documents[txt_file] = docs


使用自定义嵌入将数据存储在Qdrant矢量数据库中

Qdrant 矢量数据库提供了高效存储和检索高维矢量的强大功能。在这里,我们使用了 HuggingFace 的嵌入模型 "all-mpnet-base-v2",并将 Qdrant 用作向量数据库。


# Initialize the TextEmbedding model
from langchain.embeddings import HuggingFaceEmbeddings
# Step 3: Initialize the TextEmbedding model
embeddings = HuggingFaceEmbeddings(
  model_name="sentence-transformers/all-mpnet-base-v2"
)


我们正在为每份文档制作单独的文档集,这样基于查询的代理就能根据查询决定它需要哪些文档。


# Step 4: Create Qdrant vector store collections for each document
qdrant_collections = {}
for txt_file in txt_files:
    qdrant_collections[txt_file] = Qdrant.from_documents(
        all_documents[txt_file],
        embeddings,
        location=":memory:", 
        collection_name=txt_file,
    )


对于上述文件,我现在有 4 个收藏集:
集合: Facebook 使用条款.txt
合集: SLSHandbook 合同法.txt
合集: 谷歌服务条款_en_in.txt
合集 OPENAI 使用条款.txt


现在,我们将为每个合集制作一个检索器,它将与 LangChain Agents 集成,以实现高级信息检索功能。


retriever = {}
for txt_file in txt_files:
    retriever[txt_file] = qdrant_collections[txt_file].as_retriever()


设置 ReAct 代理

在这里,我们将制作一些函数,让代理可以选择调用这些函数并获取更多信息,从而做出决定性结论。


以下是我们正在使用的函数:


1. 用于获取相关文档(确切地说是 Retriever):


def get_relevant_document(name : str) -> str:
    # String name for fuzzy search
    search_name = name
    # Find the best match using fuzzy search
    best_match = process.extractOne(search_name, txt_files, scorer=fuzz.ratio)
    # Get the selected file name
    selected_file = best_match[0]
    
    selected_retriever = retriever[selected_file]
    global query
    results = selected_retriever.get_relevant_documents(query)
    global retrieved_text
    
    total_content = "\n\nBelow are the related document's content: \n\n"
    chunk_count = 0
    for result in results:
        chunk_count += 1
        if chunk_count > 4:
            break
        total_content += result.page_content + "\n"
    retrieved_text = total_content
    return total_content


2. 用于总结任何文本:


def get_summarized_text(name : str) -> str:
    from transformers import pipeline
    summarizer = pipeline("summarization", model="Falconsai/text_summarization")
    global retrieved_text
    article = retrieved_text
    return summarizer(article, max_length=1000, min_length=30, do_sample=False)[0]['summary_text']


3. 获取今天的日期:


def get_today_date(input : str) -> str:
    import datetime
    today = datetime.date.today()
    return f"\n {today} \n"


4. 从任何数据库中获取一个人的年龄信息:


def get_age(name: str, person_database: dict) -> int:
    """
    Get the age of a person from the database.
    Args:
    - name (str): The name of the person.
    - person_database (dict): A dictionary containing person information.
    Returns:
    - int: The age of the person if found, otherwise None.
    """
    if name in person_database:
        return person_database[name]["Age"]
    else:
        return None

def get_age_info(name: str) -> str:
    """
    Get age and health information for a person.
    Args:
    - name (str): The name of the person.
    Returns:
    - str: A string containing age and health information for the person.
    """
    person_database = {
        "Sam": {"Age": 21, "Nationality": "US"},
        "Alice": {"Age": 25, "Nationality": "UK"},
        "Bob": {"Age": 11, "Nationality": "US"}
    }
    age = get_age(name, person_database)
    if age is not None:
        return f"\nAge: {age}\n"
    else:
        return f"\nAge Information for {name} not found.\n"


在这里,我们用 "from langchain.agents import Tool "对函数进行了包装,以便让代理可以使用。


# Define the Tool
get_age_info_tool = Tool(
    name="Get Age",
    func=get_age_info,
    description="Useful for getting age information for any person. Input should be the name of the person."
)
get_today_date_tool = Tool(
    name="Get Todays Date",
    func=get_today_date,
    description="Useful for getting today's date"
)
get_relevant_document_tool = Tool(
    name="Get Relevant document",
    func=get_relevant_document,
    description="Useful for getting relevant document that we need."
)
get_summarized_text_tool = Tool(
    name="Get Summarized Text",
    func=get_summarized_text,
    description="Useful for getting summarized text for any document."
)


设置代理提示

这里我们使用了 LangChain Hub 中著名的代理提示:


from langchain import hub
prompt_react = hub.pull("hwchase17/react")
print(prompt_react.template)


该提示模板的输出:


Answer the following questions as best you can. You have access to the following tools:as best you can. You have access to the following tools:
{tools}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {input}
Thought:{agent_scratchpad}


创建 ReAct 代理

LangChain 的 ReAct 代理有助于协调整个查询处理流程。通过使用这些代理,我们可以将复杂的查询分解为易于管理的步骤,并系统地执行它们。下面,我们将创建并配置一个 ReAct 代理,以促进智能查询处理。


tools = [get_relevant_document_tool, get_summarized_text_tool, get_today_date_tool, get_age_info_tool]
retrieved_text = ""
# Load ReAct prompt
prompt_react = hub.pull("hwchase17/react")
# Initialize ChatGroq model for language understanding
model = ChatGroq(model_name="llama3-70b-8192", groq_api_key=GROQ_API_KEY, temperature=0)
# Create ReAct agent
react_agent = create_react_agent(model, tools=tools, prompt=prompt_react)
react_agent_executor = AgentExecutor(
    agent=react_agent, tools=tools, verbose=True, handle_parsing_errors=True
)


执行复杂查询

环境设置和 ReAct 代理配置完成后,我们就可以执行复杂的查询了。让我们举几个例子:


1.  查询 = "给我问题的摘要:使用 OpenAI 服务有什么年龄要求,如果用户未满 18 岁,有什么规定?


react_agent_executor.invoke({"input": query})


输出:


> Entering new AgentExecutor chain...new AgentExecutor chain...
Thought: I need to find a relevant document that mentions the age requirement for using OpenAI Services.
Action: Get Relevant document
Action Input: OpenAI Terms of Service
Below are the related document's content: 
[Document Content]
Thought: I have the relevant document, now I need to summarize the part that mentions the age requirement.
Action: Get Summarized Text
Action Input: The relevant document content
[Summarized Content]
Final Answer: The age requirement for using OpenAI Services is at least 13 years old or the minimum age required in your country to consent to use the Services. If you are under 18, you must have your parent or legal guardian's permission to use the Services.
> Finished chain.


2. 查询 =“请概括一下 Google 为用户提供哪些资源来帮助和指导用户使用其服务?”


react_agent_executor.invoke({"input": query})


输出:


> Entering new AgentExecutor chain...new AgentExecutor chain...
Thought: I need to find a relevant document that provides information on the resources Google offers to users for assistance and guidance in using its services.
Action: Get Relevant document
Action Input: Google user resources
Below are the related document's content: 
[Document Content]
Thought: I need to summarize the content of the document to get a summary of what resources Google offers to users for assistance and guidance in using its services.
Action: Get Summarized Text
Action Input: The document content
[Summarized Content]
Final Answer: Google offers various resources to users for assistance and guidance in using its services, including terms of service, privacy policy, security checkup, and service-related communications. These resources help establish what users can expect from Google and what Google expects from users, including rules for using services, intellectual property rights, and content management. Additionally, Google provides resources for organizations using its services, such as assigning Google Accounts and requiring additional rules.
> Finished chain.


3. query = "我是鲍勃。我在 2027 年是否符合 OpenAI 条款规定的使用 OpenAI 服务的年龄要求?"


react_agent_executor.invoke({"input": query})


输出:


> Entering new AgentExecutor chain...new AgentExecutor chain...
Thought: I need to know the age requirement specified in the OpenAI Terms and my current age to determine if I'll be eligible in 2027.
Action: Get Relevant document
Action Input: OpenAI Terms
Below are the related document's content: 
[Document Content]
Thought: I have the OpenAI Terms document, now I need to know my current age to determine if I'll be eligible in 2027.
Action: Get Age
Action Input: Bob
Age: 11
Thought: I now know my current age, which is 11. I need to determine if I'll be eligible in 2027 for the age requirement specified in the OpenAI Terms.
Action: Get Todays Date
Action Input: None
 2024-05-18 
Thought: I have my current age and today's date. I need to calculate my age in 2027 to determine if I'll be eligible for using the OpenAI Services.
Action: Calculate age in 2027
..... And so on as per tools available


结论

总之,LangChain 的 ReAct 代理与 Qdrant 和 Llama3 的整合为智能信息检索系统带来了令人兴奋的可能性。通过利用先进的 NLP 技术和工具,我们可以提高查询处理效率,并为用户查询提供具有洞察力的响应。


通过这些步骤,您就可以利用 LangChain、Qdrant 和 Llama3 构建自己的智能信息检索系统。


文章来源:https://medium.com/@yash9439/using-langchain-react-agents-with-qdrant-and-llama3-for-intelligent-information-retrieval-b181ce7a5962
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
热门职位
Maluuba
20000~40000/月
Cisco
25000~30000/月 深圳市
PilotAILabs
30000~60000/年 深圳市
写评论取消
回复取消