【指南】RAG应用程序中文本嵌入的局限性

2024年09月27日由 alex 发表 396 0

每个人都喜欢文本嵌入模型，这是有道理的：它们擅长对非结构化文本进行编码，从而更容易发现语义相似的内容。它们是大多数 RAG 应用程序的支柱，这一点也不奇怪，尤其是在当前强调从文档和其他文本资源中编码和检索相关信息的情况下。然而，在一些明显的例子中，人们可能会问到一些问题，在这些问题中，RAG 应用程序的文本嵌入方法会出现不足，并提供错误的信息。

如前所述，文本嵌入非常适合编码非结构化文本。另一方面，它们在处理结构化信息和操作（如过滤、排序或聚合）方面并不擅长。试想一个简单的问题：

2024 年上映的评分最高的电影是哪部？

要回答这个问题，我们必须先按上映年份进行筛选，然后再按评分排序。我们将考察使用文本嵌入的简单方法的性能，然后演示如何处理此类问题。这篇文章展示了在处理过滤、排序或聚合等结构化数据操作时，需要使用与文本嵌入不同的工具。

环境设置

在本文中，我们将使用 Neo4j Sandbox 中的推荐项目。推荐项目使用的是 MovieLens 数据集，其中包含电影、演员、评分等信息。

以下代码将实例化一个 LangChain 封装器，以连接 Neo4j 数据库：

os.environ["NEO4J_URI"] = "bolt://44.204.178.84:7687""NEO4J_URI"] = "bolt://44.204.178.84:7687"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "minimums-triangle-saving"
graph = Neo4jGraph(refresh_schema=False)

此外，你还需要一个 OpenAI API 密钥，请在以下代码中输入该密钥：

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")"OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

数据库包含 10,000 部电影，但尚未存储文本嵌入。为了避免计算所有电影的嵌入，我们将用一个名为 “目标 ”的二级标签来标记这 1000 部评分最高的电影：

graph.query(""""""
MATCH (m:Movie)
WHERE m.imdbRating IS NOT NULL
WITH m
ORDER BY m.imdbRating DESC
LIMIT 1000
SET m:Target
""")

计算和存储文本嵌入

决定嵌入哪些内容是一个重要的考虑因素。由于我们将演示按年份筛选和按评分排序，因此将这些细节排除在嵌入文本之外是不公平的。这就是为什么我选择捕捉每部电影的发行年份、评级、标题和描述。

下面是我们要嵌入的《华尔街之狼》的文本示例：

plot: Based on the true story of Jordan Belfort, from his rise to a wealthy 
      stock-broker living the high life to his fall involving crime, corruption
      and the federal government.
title: Wolf of Wall Street, The
year: 2013
imdbRating: 8.2

你可能会说这不是嵌入结构化数据的好方法，我不会反驳，因为我不知道最好的方法是什么。也许我们应该把它们转换成文本或其他形式，而不是键值项。如果你有更好的想法，请告诉我。

LangChain 中的 Neo4j 向量对象有一个方便的方法 from_existing_graph，你可以在其中选择哪些文本属性应该被编码：

embedding = OpenAIEmbeddings(model="text-embedding-3-small")"text-embedding-3-small")
neo4j_vector = Neo4jVector.from_existing_graph(
    embedding=embedding,
    index_name="movies",
    node_label="Target",
    text_node_properties=["plot", "title", "year", "imdbRating"],
    embedding_node_property="embedding",
)

在本例中，我们使用 OpenAI 的 text-embedding-3-small 模型来生成嵌入。我们使用 from_existing_graph 方法初始化 Neo4jVector 对象。node_label 参数筛选要编码的节点，特别是那些标有 Target 的节点。text_node_properties 参数定义了要嵌入的节点属性，包括情节、标题、年份和 imdbRating。最后，embedding_node_property 定义了存储生成的嵌入结果的属性，指定为 embedding。

简单方法

让我们先尝试根据情节或描述查找一部电影：

pretty_print(
    neo4j_vector.similarity_search(
        "What is a movie where a little boy meets his hero?""What is a movie where a little boy meets his hero?"
    )
)

结果：

plot: A young boy befriends a giant robot from outer space that a paranoid government agent wants to destroy.
title: Iron Giant, The
year: 1999
imdbRating: 8.0
plot: After the death of a friend, a writer recounts a boyhood journey to find the body of a missing boy.
title: Stand by Me
year: 1986
imdbRating: 8.1
plot: A young, naive boy sets out alone on the road to find his wayward mother. Soon he finds an unlikely protector in a crotchety man and the two have a series of unexpected adventures along the way.
title: Kikujiro (Kikujirô no natsu)
year: 1999
imdbRating: 7.9
plot: While home sick in bed, a young boy's grandfather reads him a story called The Princess Bride.
title: Princess Bride, The
year: 1987
imdbRating: 8.1

总的看来，结果还不错。虽然我不确定小男孩是否总能遇到他心目中的英雄，但他一直都在参与其中。不过，这个数据集只包含 1,000 部电影，所以可选项有限。

现在，让我们尝试一下需要进行一些基本过滤的查询：

pretty_print(
    neo4j_vector.similarity_search(
        "Which movies are from year 2016?"
    )
)

结果：

plot: Six short stories that explore the extremities of human behavior involving people in distress.
title: Wild Tales
year: 2014
imdbRating: 8.1
plot: A young man who survives a disaster at sea is hurtled into an epic journey of adventure and discovery. While cast away, he forms an unexpected connection with another survivor: a fearsome Bengal tiger.
title: Life of Pi
year: 2012
imdbRating: 8.0
plot: Based on the true story of Jordan Belfort, from his rise to a wealthy stock-broker living the high life to his fall involving crime, corruption and the federal government.
title: Wolf of Wall Street, The
year: 2013
imdbRating: 8.2
plot: After young Riley is uprooted from her Midwest life and moved to San Francisco, her emotions - Joy, Fear, Anger, Disgust and Sadness - conflict on how best to navigate a new city, house, and school.
title: Inside Out
year: 2015
imdbRating: 8.3

有趣的是，没有一部 2016 年的电影入选。也许我们可以通过不同的文本编码准备获得更好的结果。不过，文本嵌入在这里并不适用，因为我们要处理的是一个简单的结构化数据操作，我们需要根据元数据属性过滤文档，或者在本例中过滤电影。元数据过滤是一种成熟的技术，经常用于提高 RAG 系统的准确性。

下一个查询需要进行排序：

pretty_print(
    neo4j_vector.similarity_search("Which movie has the highest imdb score?")"Which movie has the highest imdb score?")
)

结果：

plot: A silent film production company and cast make a difficult transition to sound.
title: Singin' in the Rain
year: 1952
imdbRating: 8.3
plot: A film about the greatest pre-Woodstock rock music festival.
title: Monterey Pop
year: 1968
imdbRating: 8.1
plot: This movie documents the Apollo missions perhaps the most definitively of any movie under two hours. Al Reinert watched all the footage shot during the missions--over 6,000,000 feet of it, ...
title: For All Mankind
year: 1989
imdbRating: 8.2
plot: An unscrupulous movie producer uses an actress, a director and a writer to achieve success.
title: Bad and the Beautiful, The
year: 1952
imdbRating: 7.9

如果你熟悉 IMDb 的评分，就会知道有很多电影的评分都在 8.3 分以上。在我们的数据库中，评分最高的电影实际上是一部系列片--《兄弟连》（Band of Brothers），评分高达 9.6 分，令人印象深刻。文本嵌入在结果排序方面再次表现不佳。

我们还可以评估一个需要进行某种汇总的问题：

pretty_print(neo4j_vector.similarity_search("How many movies are there?"))"How many movies are there?"))

结果：

plot: Ten television drama films, each one based on one of the Ten Commandments.
title: Decalogue, The (Dekalog)
year: 1989
imdbRating: 9.2
plot: A documentary which challenges former Indonesian death-squad leaders to reenact their mass-killings in whichever cinematic genres they wish, including classic Hollywood crime scenarios and lavish musical numbers.
title: Act of Killing, The
year: 2012
imdbRating: 8.2
plot: A meek Hobbit and eight companions set out on a journey to destroy the One Ring and the Dark Lord Sauron.
title: Lord of the Rings: The Fellowship of the Ring, The
year: 2001
imdbRating: 8.8
plot: While Frodo and Sam edge closer to Mordor with the help of the shifty Gollum, the divided fellowship makes a stand against Sauron's new ally, Saruman, and his hordes of Isengard.
title: Lord of the Rings: The Two Towers, The
year: 2002
imdbRating: 8.7

结果肯定没有帮助，因为我们随机得到了四部电影。从这随机的四部电影中几乎不可能得出结论，我们在这个示例中标记并嵌入了总共 1000 部电影。

结构化数据工具

目前，大多数人似乎都在考虑文本查询（text2query）方法，即由 LLM 生成数据库查询，根据提供的问题和模式与数据库交互。对于 Neo4j，这是 text2cypher，但对于 SQL 数据库，也有 text2sql。然而，在实践中发现，这种方法并不可靠，在生产中使用也不够稳健。

你可以使用思维链、少量示例或微调等技术，但在现阶段要达到高准确度几乎是不可能的。text2query 方法适用于简单数据库模式下的简单问题，但这并不是生产环境的实际情况。为了解决这个问题，我们将生成数据库查询的复杂性从 LLM 转移到代码问题上，根据函数输入确定性地生成数据库查询。这样做的好处是大大提高了鲁棒性，但代价是降低了灵活性。最好是缩小 RAG 应用的范围并准确回答这些问题，而不是试图回答所有问题，但却回答得不准确。

由于我们是根据函数输入生成数据库查询（本例中为 Cypher 语句），因此我们可以利用 LLM 的工具功能。在此过程中，LLM 根据用户输入填充相关参数，而函数则负责检索必要信息。在本演示中，我们将首先实现两个工具：一个用于计算电影数量，另一个用于列出电影清单，然后使用 LangGraph 创建一个 LLM 代理。

计算电影数量的工具

首先，我们将根据预定义的过滤器来实现电影计数工具。首先，我们必须定义这些过滤器，并向 LLM 描述何时以及如何使用它们：

class MovieCountInput(BaseModel):
    min_year: Optional[int] = Field(
        description="Minimum release year of the movies"
    )
    max_year: Optional[int] = Field(
        description="Maximum release year of the movies"
    )
    min_rating: Optional[float] = Field(description="Minimum imdb rating")
    grouping_key: Optional[str] = Field(
        description="The key to group by the aggregation", enum=["year"]
    )

LangChain 提供了多种定义函数输入的方法，但我更喜欢 Pydantic 方法。在本示例中，我们有三个过滤器可用于细化电影结果：min_year、max_year 和 min_rating。这些筛选器基于结构化数据，是可选的，用户可以选择包含其中任何一个、全部或一个都不包含。此外，我们还引入了分组键（grouping_key）输入，告诉函数是否按特定属性对计数进行分组。在本例中，唯一支持的分组方式是枚举小节中定义的按年份分组。

现在让我们定义实际函数：

@tool("movie-count", args_schema=MovieCountInput)
def movie_count(
    min_year: Optional[int],
    max_year: Optional[int],
    min_rating: Optional[float],
    grouping_key: Optional[str],
) -> List[Dict]:
    """Calculate the count of movies based on particular filters"""
    filters = [
        ("t.year >= $min_year", min_year),
        ("t.year <= $max_year", max_year),
        ("t.imdbRating >= $min_rating", min_rating),
    ]
    # Create the parameters dynamically from function inputs
    params = {
        extract_param_name(condition): value
        for condition, value in filters
        if value is not None
    }
    where_clause = " AND ".join(
        [condition for condition, value in filters if value is not None]
    )
    cypher_statement = "MATCH (t:Target) "
    if where_clause:
        cypher_statement += f"WHERE {where_clause} "
    return_clause = (
        f"t.`{grouping_key}`, count(t) AS movie_count"
        if grouping_key
        else "count(t) AS movie_count"
    )
    cypher_statement += f"RETURN {return_clause}"
    print(cypher_statement)  # Debugging output
    return graph.query(cypher_statement, params=params)

movie_count 函数生成一个 Cypher 查询，根据可选的筛选器和分组关键字对电影进行计数。首先，它定义了一个过滤器列表，并将相应的值作为参数提供。筛选器用于动态构建 WHERE 子句，该子句负责在 Cypher 语句中应用指定的筛选条件，其中只包括那些值不是 None 的条件。

然后构建 Cypher 查询的 RETURN 子句，或者根据提供的 grouping_key 进行分组，或者简单地计算电影总数。最后，函数执行查询并返回结果。

该函数可以根据需要扩展更多参数和更复杂的逻辑，但重要的是要确保它保持清晰，以便 LLM 可以正确、准确地调用它。

电影列表工具

同样，我们必须先定义函数的参数：

class MovieListInput(BaseModel):
    sort_by: str = Field(
        description="How to sort movies, can be one of either latest, rating",
        enum=["latest", "rating"],
    )
    k: Optional[int] = Field(description="Number of movies to return")
    description: Optional[str] = Field(description="Description of the movies")
    min_year: Optional[int] = Field(
        description="Minimum release year of the movies"
    )
    max_year: Optional[int] = Field(
        description="Maximum release year of the movies"
    )
    min_rating: Optional[float] = Field(description="Minimum imdb rating")

我们保留了电影计数函数中的三个过滤器，但增加了描述参数。通过这个参数，我们可以使用矢量相似性搜索，根据情节搜索并列出电影。虽然我们使用的是结构化工具和过滤器，但这并不意味着我们不能使用文本嵌入和向量搜索方法。由于我们在大多数情况下并不想返回所有电影，因此我们加入了一个可选的 k 输入，并设置了默认值。此外，在列表中，我们希望对影片进行排序，只返回最相关的影片。在这种情况下，我们可以按评分或发行年份排序。

让我们来实现这个函数：

@tool("movie-list", args_schema=MovieListInput)
def movie_list(
    sort_by: str = "rating",
    k : int = 4,
    description: Optional[str] = None,
    min_year: Optional[int] = None,
    max_year: Optional[int] = None,
    min_rating: Optional[float] = None,
) -> List[Dict]:
    """List movies based on particular filters"""
    # Handle vector-only search when no prefiltering is applied
    if description and not min_year and not max_year and not min_rating:
        return neo4j_vector.similarity_search(description, k=k)
    filters = [
        ("t.year >= $min_year", min_year),
        ("t.year <= $max_year", max_year),
        ("t.imdbRating >= $min_rating", min_rating),
    ]
    # Create parameters dynamically from function arguments
    params = {
        key.split("$")[1]: value for key, value in filters if value is not None
    }
    where_clause = " AND ".join(
        [condition for condition, value in filters if value is not None]
    )
    cypher_statement = "MATCH (t:Target) "
    if where_clause:
        cypher_statement += f"WHERE {where_clause} "
    # Add the return clause with sorting
    cypher_statement += " RETURN t.title AS title, t.year AS year, t.imdbRating AS rating ORDER BY "
    # Handle sorting logic based on description or other criteria
    if description:
        cypher_statement += (
            "vector.similarity.cosine(t.embedding, $embedding) DESC "
        )
        params["embedding"] = embedding.embed_query(description)
    elif sort_by == "rating":
        cypher_statement += "t.imdbRating DESC "
    else:  # sort by latest year
        cypher_statement += "t.year DESC "
    cypher_statement += " LIMIT toInteger($limit)"
    params["limit"] = k or 4
    print(cypher_statement)  # Debugging output
    data = graph.query(cypher_statement, params=params)
    return data

该函数根据多个可选筛选条件检索电影列表：描述、年份范围、最低评分和排序偏好。如果只给出描述而没有其他筛选条件，则会执行向量索引相似性搜索来查找相关电影。如果应用了其他筛选条件，该函数会构建一个 Cypher 查询，根据指定的条件（如上映年份和 IMDb 评级）来匹配电影，并将它们与可选的基于描述的相似性结合起来。然后，查询结果会按照相似度得分、IMDb 评分或年份进行排序，并限制在 k 部电影之内。

以 LangGraph 代理的形式实现所有功能

我们将使用 LangGraph 实现一个简单的 ReAct 代理。

代理由 LLM 和工具步骤组成。在与代理交互时，我们首先会调用 LLM 来决定是否使用工具。然后，我们将运行一个循环：

如果代理要求我们采取行动（即调用工具），我们就会运行工具，并将结果反馈给代理。
如果代理没有要求运行工具，我们就结束（回复用户）。

代码实现简单明了。首先，我们将工具绑定到 LLM，并定义助手步骤：

llm = ChatOpenAI(model='gpt-4-turbo')'gpt-4-turbo')
tools = [movie_count, movie_list]
llm_with_tools = llm.bind_tools(tools)
# System message
sys_msg = SystemMessage(content="You are a helpful assistant tasked with finding and explaining relevant information about movies.")
# Node
def assistant(state: MessagesState):
   return {"messages": [llm_with_tools.invoke([sys_msg] + state["messages"])]}

接下来我们定义 LangGraph 流程：

# Graph
builder = StateGraph(MessagesState)
# Define nodes: these do the work
builder.add_node("assistant", assistant)
builder.add_node("tools", ToolNode(tools))
# Define edges: these determine how the control flow moves
builder.add_edge(START, "assistant")
builder.add_conditional_edges(
    "assistant",
    # If the latest message (result) from assistant is a tool call -> tools_condition routes to tools
    # If the latest message (result) from assistant is a not a tool call -> tools_condition routes to END
    tools_condition,
)
builder.add_edge("tools", "assistant")
react_graph = builder.compile()

我们在 LangGraph 中定义了两个节点，并用条件边将它们连接起来。如果调用了工具，流程就会导向工具；否则，结果就会发回给用户。

现在让我们测试一下我们的代理：

messages = [
    HumanMessage(
        content="What are the some movies about a girl meeting her hero?""What are the some movies about a girl meeting her hero?"
    )
]
messages = react_graph.invoke({"messages": messages})
for m in messages["messages"]:
    m.pretty_print()

结果：

第一步，代理选择使用带有适当描述参数的电影列表工具。目前还不清楚它为什么选择 5 的 k 值，但它似乎倾向于这个数字。该工具会根据情节返回最相关的前五部电影，而 LLM 只会在最后为用户总结这些电影。

如果我们问 ChatGPT 为什么喜欢 k 值为 5，我们会得到如下回复。

接下来，让我们提出一个需要过滤元数据的稍微复杂一点的问题：

messages = [
    HumanMessage(
        content="What are the movies from the 90s about a girl meeting her hero?""What are the movies from the 90s about a girl meeting her hero?"
    )
]
messages = react_graph.invoke({"messages": messages})
for m in messages["messages"]:
    m.pretty_print()

结果：

这一次，使用了额外的参数，只过滤 1990 年代的电影。这个例子是使用预过滤方法过滤元数据的典型例子。生成的 Cypher 语句首先通过过滤电影的发行年份来缩小电影的范围。在下一部分中，Cypher 语句使用文本嵌入和向量相似性搜索来查找关于小女孩遇见她的英雄的电影。

让我们尝试根据各种条件来统计电影：

messages = [
    HumanMessage(
        content="How many movies are from the 90s have the rating higher than 9.1?""How many movies are from the 90s have the rating higher than 9.1?"
    )
]
messages = react_graph.invoke({"messages": messages})
for m in messages["messages"]:
    m.pretty_print()

结果：

有了专门的计数工具，复杂性就从 LLM 转移到了工具上，LLM 只负责填充相关的函数参数。这种任务分离使系统更加高效、稳健，并降低了 LLM 输入的复杂性。

既然代理可以连续或并行调用多个工具，那么我们就用更复杂的工具来测试它：

messages = [
    HumanMessage(
        content="How many were movies released per year made after the highest rated movie?""How many were movies released per year made after the highest rated movie?"
    )
]
messages = react_graph.invoke({"messages": messages})
for m in messages["messages"]:
    m.pretty_print()

结果：

如前所述，代理可以调用多种工具来收集回答问题所需的全部信息。在本例中，它首先列出了评分最高的电影，以确定评分最高的电影是何时上映的。有了这些数据后，它就会调用电影计数工具，使用问题中定义的分组键收集指定年份之后上映的电影数量。

总结

虽然文本嵌入非常适合搜索非结构化数据，但在进行过滤、排序和聚合等结构化操作时，文本嵌入就显得力不从心了。这些任务需要专为结构化数据设计的工具，它们能提供处理这些操作所需的精确性和灵活性。主要的启示是，扩展系统中的工具集可以让你处理更广泛的用户查询，使你的应用程序更加强大和通用。将结构化数据方法与非结构化文本搜索技术相结合，可以提供更准确、更相关的响应，最终提升 RAG 应用程序的用户体验。

文章来源：https://medium.com/neo4j/limitations-of-text-embeddings-in-rag-applications-b060020b543b

标签：

Neo4j 人工智能文本嵌入

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇通过Python Notebook和OpenAI CLIP为视频构建矢量嵌入

下一篇【指南】如何安装和使用CLIP

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来