GenAI构建指南：使用本地LLM进行生成搜索

2024年04月30日由 alex 发表 591 0

什么是生成式搜索？

我们平均每天在互联网上搜索 3-4 次，但每次搜索时，你可能需要尝试不同的搜索词，查看不同的搜索结果页面，并自己从广告页面中找出答案。很多时候，人们还缺乏制定一个好的搜索词的能力（例如，很多人不知道如何执行带引号的精确匹配、带减号的否定关键字或特定网站搜索）。

在 LLM 的帮助下，他们可以帮你进行搜索，并根据搜索结果提供简明扼要的答案。去年出现了许多基于 GPT 的生成式搜索应用，如 Phind、Perplexity，甚至谷歌也发布了实验性的生成式搜索功能，这有可能会削弱其自身的传统搜索和广告业务。

除了一般的在线搜索，你还可以将生成式搜索应用于特定的内容或网站，如企业的常见问题内容、研发团队的文档，甚至是你的个人笔记。所有这些都是生成式搜索应用，可以为你的查询提供更好的答案，而无需你亲自点击结果列表。

了解架构

鉴于生成式搜索如此有用，而且还有独角兽初创公司围绕它展开业务，你可能会认为生成式搜索很难。事实上，如果我们将其分解为基本组件，核心概念就会变得非常简单：对于极简版本，你所需要的只是一个搜索功能和一个 LLM。

以下是从查询到答案的流程：

查询输入：你输入一个查询，就像传统的搜索一样。但你可以使用完整的句子而不是关键词，因为 LLM 可以理解你的意图。
查询重写：LLM 会尝试理解你的查询，并为你重写适当的搜索词。
在线搜索：搜索功能将搜索与搜索词相关的在线内容
(可选：重试和完善：法律硕士可能会决定尝试使用另一个搜索词来完善搜索结果）。
处理和回答：法律硕士将阅读搜索引擎反馈的内容，并根据内容生成答案。

当然，在生成式搜索中可以有许多不同的优化方法来改进输出结果。例如，可以对搜索结果进行抓取、分块并保存为矢量以供检索（这是检索增强生成（RAG）的典型过程），还可以优化向 LLM 提供内容的方式。

不过，今天我们将只构建一个简约版本的生成式搜索，以帮助你更好地理解核心流程。这将为今后探索更高级的技术和优化打下坚实的基础。

逐步建立生成式搜索

设置本地 LLM

让我们从生成式搜索的 LLM 部分开始。对于 LLM，我不想使用 OpenAI 的 GPT API，而是想向你展示使用本地 LLM 是多么容易。你可以在本地电脑上免费运行 LLM（如果你需要它来总结很长的文本，这将特别有用）。如果你注重隐私，使用本地 LLM 还能确保你不会向 OpenAI 发送敏感信息。另一个优势是，如果你要搜索 GPT 审查过的内容，可以使用未经审查的本地 LLM。

我选择的本地 LLM 是 OpenHermes-2.5-Mistral-7B-16K，我使用的是它的量化 GGUF 版本，因此计算量更少，可以装在消费级显卡的 VRAM 上，或者装在带 CPU 的 RAM 上。

让我们安装 llama-cpp-python（用于运行 LLM 并在 Python 中轻松使用它的工具），并用 aria2 下载模型（当然，你也可以手动将模型下载到本地目录中）。

# Install llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir
%cd /content
!apt-get update -qq && apt-get install -y -qq aria2
# Download a local large language model, I'm using OpernHermes-2.5-Mistral-7B-16K-GGUF which has a longer context size and has pretty good quality at its size
# If you want to use other local models that can easily run on consumer hardware, check out this repo: https://github.com/Troyanovsky/Local-LLM-Comparison-Colab-UI/
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf?download=true -d /content/model/ -o openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf

安装 llama-cpp-python 并下载模型后，我们需要定义一些函数来加载模型并在提示符下调用。

要加载模型，只需使用 llama_cpp 中的实用程序，你可以设置是否要在 GPU 上运行模型（如果有的话，这比在 CPU 上运行要快得多），以及上下文大小

# Setting up a local LLM for summarization or chat
from llama_cpp import Llama
def load_llama():
    llm = Llama(
            model_path="/content/model/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf", # If you're using another model, change the name
            chat_format="chatml", # Use the chat_format that matches the model
            n_gpu_layers=-1, # Use -1 for all layers on GPU
            n_ctx=12288 # Set context size
    )
    return llm

然后，你可以定义一个函数来调用本地模型，就像使用 OpenAI 的 GPT 端点一样。多亏了 llama-cpp-python，他们很好地打包了本地 LLM 的用法，因此从 GPT 过渡到本地 LLM 变得轻而易举：

def call_llama(input, llm):
    llm = llm
    output = llm.create_chat_completion(
        messages=[
            {
                "role": "system",
                "content": "You're a helpful assistant.",
            }, # Feel free to modify the prompt to suit your own formatting needs
            {"role": "user", "content": input},
        ],
        temperature=0.7,
    )
    output_text = output['choices'][0]['message']['content']
    return output_text

这就是本地LLM的部分！有了 llama-cpp 的这些简单设置，你几乎可以使用在 Hugging Face 上找到的任何 GGUF 格式的本地 LLM。

定义搜索功能

接下来，让我们进入本项目的搜索部分。在这部分中，我们将使用搜索 API，根据搜索模板在互联网上搜索 URL 列表，并使用另一个函数将 URL 转换为本地语言管理器易于理解的 markdown 格式。

我们首先需要导入必要的库：

import requests
import subprocess
import json
import time

然后，让我们定义一个用于调用搜索 API 的函数。实际上，你可以使用你选择的任何搜索 API，不管是 Google、Bing、Brave 还是其他。在这里，我使用的是 Serper（https://serper.dev/）提供的 API，这是一个谷歌搜索 API，具有宽松的免费使用限制（2500 次查询免费，无需信用卡）。你只需在他们的网站上注册，并在账户中获得一个 API 密钥。

在这里，我们定义了一个 get_search_results 函数，该函数将搜索词作为输入，并从 Google 搜索 API 返回搜索结果列表。为防止可能出现的故障，我们可以在 API 调用中加入重试逻辑。

def get_search_results(search_term, max_retries=2, retry_delay=2):
    url = "https://google.serper.dev/search"
    payload = json.dumps({"q": search_term})
    headers = {
        'X-API-KEY': '<your_api_key>', # Replace with your own API Key
        'Content-Type': 'application/json'
    }
    retries = 0
    while retries < max_retries:
        try:
            response = requests.request("POST", url, headers=headers, data=payload)
            response.raise_for_status()  # Raise an exception for non-2xx status codes
            data = response.json()
            organic_results = data.get("organic", [])
            search_results = []
            search_results_str = ""
            index = 0
            for result in organic_results:
                title = result.get("title", "")
                link = result.get("link", "")
                snippet = result.get("snippet", "")
                search_results.append({"title": title, "link": link, "snippet": snippet})
                formatted_result = f"index: {index}\ntitle: {title}\nlink: {link}\nsnippet: {snippet}\n\n"
                search_results_str += formatted_result
                index += 1
            return search_results, search_results_str
        except requests.exceptions.RequestException as e:
            retries += 1
            print(f"Error: {e}. Retrying in {retry_delay} seconds... (Attempt {retries}/{max_retries})")
            time.sleep(retry_delay)
    raise Exception("Maximum retries exceeded. Failed to retrieve search results.")

让我们定义另一个函数 fetch_url_content。该函数将 URL 作为输入，并获取网页内容。我们可以使用 Jina AI 的阅读器工具 (https://github.com/jina-ai/reader/) 将 URL 转换为 LLM 友好的格式，只需在要抓取的 URL 中添加前缀 “https://r.jina.ai/”即可。

def fetch_url_content(url):
    # Prepend "https://r.jina.ai/" to the input URL
    # This converts the URL into LLM-friendly format. Check out their GitHub: https://github.com/jina-ai/reader
    prefixed_url = f"https://r.jina.ai/{url}"

    try:
        curl_cmd = [
            "curl",
            "-H",
            "Accept: text/event-stream",
            prefixed_url,
        ]
        curl_process = subprocess.Popen(curl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        stdout, stderr = curl_process.communicate()
        if curl_process.returncode == 0:
            content = stdout.decode("utf-8")
            content_lines = [line for line in content.split("\n") if line.startswith("data: ")]
            if content_lines:
                content_data = "\n".join(line[6:] for line in content_lines)
                try:
                    content_value = json.loads(content_data)["content"]
                    return content_value
                except (ValueError, KeyError):
                    pass
            return ""
        else:
            error_message = stderr.decode("utf-8")
            raise Exception(f"cURL request failed: {error_message}")
    except Exception as e:
        raise Exception(f"An error occurred: {e}")

选择要抓取的 URL

我们可以让 LLM 根据 URL 的片段和用户的查询，从搜索结果列表中选择最相关的 URL，而不是抓取搜索功能返回的每个 URL。就像我们在谷歌搜索时所做的那样。

pick_url 函数接收用户的查询和来自搜索函数的搜索结果，并要求 LLM 挑选出最相关的 URL 索引。为防止出现任何潜在故障（如 LLM 响应非整数索引或不相关内容），我们还将在此处加入一些错误处理和重试逻辑。

def pick_url(query, search_results_str, search_results, llm):
    llm = llm
    prompt = f"Given the following question, which of the following URLs is most likely to contain the answer for it? Reply ONLY the index number. Question: ```{query}``` List: ```{search_results_str}```"
    index = call_llama(prompt, llm)
    max_retries = 2
    retries = 0
    while retries < max_retries:
        try:
            index = int(index.strip())
            break
        except ValueError:
            retries += 1
            index = call_llama(prompt, llm)
    if retries == max_retries:
        raise Exception("Failed to convert index to a valid integer after multiple retries.")
    try:
        return index
    except IndexError:
        raise Exception(f"Invalid index {index} for the search results list.")

将所有组件整合在一起

最后，让我们定义将所有组件粘合在一起的主函数。search_with_ai 函数将完成以下工作：

接收用户查询
使用 LLM 根据用户的查询得出一个搜索词
使用搜索功能获取 URL 列表
让 LLM 挑选出最相关的 URL
抓取最相关的 URL 并交给 LLM
生成最终答案

def search_with_ai(user_input):
    llm = None
    llm = load_llama()
    search_term_prompt = f"Based on the following question, plesae come up with a search term to use in the search engine. Reply the search term only. Quesiton: ```{user_input}```"
    search_term = call_llama(search_term_prompt, llm)
    print(f"Searching: {search_term}")
    # Seach with search API
    search_results, search_results_str = get_search_results(search_term)
    # Pick the most relevant URL
    try:
        top_url_index = pick_url(user_input, search_results_str, search_results, llm)
    except Exception as e:
        print(f"Error picking URL: {e}")
        return
    # Fetch the content from the top URL
    try:
        top_url = search_results[top_url_index]["link"]
        top_snippet = search_results[top_url_index]["snippet"]
        print(f"Crawling: {top_url}")
        content = fetch_url_content(top_url)
    except Exception as e:
        print(f"Error fetching URL content: {e}")
        del llm
        return
    # Truncate the content if it's longer than 36864 characters. I'm using a very lazy estimate here. You can count actual tokens instead.
    if len(content) > 36864:
        content = content[:36864]
    # Call LLM with the content and get the answer
    answer_prompt = f"Answer the question from the given content. Question: ```{user_input}```\n\nContent:```From URL: {top_url} Snippet: {top_snippet}\n{content}```"
    try:
        answer = call_llama(answer_prompt, llm)
        return answer
    except Exception as e:
        print(f"Error calling LLM: {e}")
        return

你已经成功地构建了自己的生成式人工智能搜索，并带有本地 LLM！现在只需调用 search_with_ai 函数，就可以开始向它提问了！

下面是一个结果示例

结论

在本文中，我们探讨了如何使用本地 LLM 构建自己的生成式搜索工具的基础知识。我认为，通过将这一过程分解成更小的组件并提供一个循序渐进的教程，可以为你揭开生成式搜索工作原理的神秘面纱，并为你的下一个生成式人工智能产品创意奠定基础。

文章来源：https://medium.com/design-bootcamp/build-with-genai-generative-search-with-local-llm-342eb5a5037a

标签：

数据科学人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇如何使用GPT API从PDF出版物导出研究图表

下一篇 Jupyter AI：快速原型设计的开源工具

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来