通过查询扩展和重新排序模型改进RAG

2024年03月07日由 alex 发表 716 0

查询扩展：

在进行信息检索时，你并不总能得到你想要的东西。为提高检索系统的召回率，有人提出了一种方法，那就是查询扩展，即在搜索查询中添加额外的术语，恢复可能与初始查询词性不重叠的相关文档。这一想法对于提高检索增强生成（RAG）系统的性能尤为重要和有用。

为什么要使用查询扩展？

查询扩展之所以重要，有以下几个原因：

提高召回率：它有助于检索与查询语义相关但不一定共享共同关键词的文档。
解决查询含糊不清的问题：它有利于简短或含糊不清的查询，提供更多的上下文和清晰度。
增强文档匹配：扩展查询术语可提高与数据库中正确文档匹配的可能性。

查询扩展的 LLM 方法

最近的进展提出利用大语言模型（LLM）进行查询扩展。与伪相关性反馈（PRF）等依赖于检索文档内容的传统方法不同，LLM 利用其生成能力来创建有意义的查询扩展。这种方法可以利用 LLM 中编码的固有知识，生成可能与原始查询相关的替代术语和短语。

下面是查询扩展的代码（注意，我们使用的是 OpenAI 聊天 API）。

#we are using openai for generating query
import os
import openai
from openai import OpenAI
openai.api_key = os.environ['OPENAI_API_KEY']
openai_client = OpenAI()

def augment_query_generated(query, model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Provide an example answer to the given question, that might be found in a document like an annual report. "
        },
        {"role": "user", "content": query}
    ]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

original_query = "What were the most important factors that contributed to increases in revenue?"
hypothetical_answer = augment_query_generated(original_query)
# we are combining our orignal query + hypothetical_answer 
joint_query = f"{original_query} {hypothetical_answer}"
print(word_wrap(joint_query))

对于给定的查询，提示会要求 LLM 生成一个假设答案。我们可以将生成的答案与原始查询结合起来，然后将其作为联合查询传回 LLM。这样就能在 LLM 提取结果之前为其提示提供更多上下文。

outout:
What were the most important factors that contributed to increases in
revenue? In the fiscal year 2020, several key factors contributed to
the significant increase in our company's revenue. Firstly, we
implemented a successful marketing campaign that effectively targeted
new customer segments and enhanced our brand visibility. This resulted
in a substantial growth in our customer base and overall
sales.
Secondly, we expanded our product line by introducing
innovative products that catered to evolving consumer preferences. This
diversification strategy allowed us to tap into new markets and
capitalize on emerging trends, thereby driving revenue
growth.
Additionally, our commitment to customer satisfaction and
delivering exceptional service played a crucial role in increasing
revenue. By focusing on enhancing customer experience and implementing
customer retention programs, we not only fostered loyalty but also
attracted new customers through positive word-of-mouth
recommendations.
Furthermore, we adopted a proactive approach to
pricing and cost management. Through effective cost-cutting measures
and strategic pricing adjustments, we were able to optimize
profitability without compromising on product quality or customer
satisfaction. This emphasis on achieving operational efficiency
positively impacted our revenue growth.
Lastly, our investments in
technology and digital transformation significantly contributed to
revenue increases. By leveraging data analytics and automation, we
streamlined our processes, improved our decision-making capabilities,
and personalized customer experiences. These technological advancements
resulted in higher customer engagement and increased revenue
generation.
In conclusion, the most important factors that contributed
to the increases in our revenue included successful marketing
campaigns, product diversification, customer satisfaction initiatives,
strategic pricing, and investments in technology and digital
transformation.

正如你所看到的，在 RAG 系统中使用查询扩展有几个好处：

更好的文档检索：扩展查询可实现更准确、更全面的文档检索，这是 RAG 模型的关键步骤。
增强理解：扩展查询为 RAG 模型提供了更广泛的上下文，提高了模型的理解能力和响应能力。
多功能性：这种方法适用于各种领域和查询类型，增强了 RAG 模型的多功能性。

重新排名的缺点和作用

虽然查询扩展有很大的好处，但也不是没有缺点：

过度扩展：添加过多术语有时会导致检索到不相关的文档。
质量控制：有时只能保证扩展术语的相关性。

为了缓解这些问题，重排起着至关重要的作用。它可以完善初始检索输出，根据文档与扩展查询的相关性重新校准文档排名。这确保了只有最相关的文档才会被优先排序，从而有效地筛除了查询扩展带来的噪音。

交叉编码器重新排序

在重新排序方法中，交叉编码器模型因其能够显著提高搜索准确性而脱颖而出。这些模型不同于余弦相似度等传统排名指标，而是采用深度学习来直接评估每个文档与查询之间的一致性。交叉编码器通过串联处理查询和文档来输出相关性得分，从而实现更细致的文档选择过程。

实际应用

在实际应用中，交叉编码器重排是在扩展搜索查询以包含更广泛的文档集之后进行的。这种方法不仅能从初始检索中完善文档选择，还能通过以下方式提高 RAG 模型的实用性：

提高精确度：交叉编码器可确保根据文档与查询的实际相关性对文档进行排序，从而提高文档检索的精确度。
扩展通用性：这种方法可无缝适应各种领域和查询类型，提高了 RAG 模型的适应性。

使用案例示例

假设你的应用程序需要根据文档与用户查询的相关性对文档进行检索和排序。在初始查询扩展后，你将应用交叉编码器模型对结果进行重新排序。

在下面的示例中，我们使用了句子转换器 Cross-Encoder。你需要从你的 RAG 中传递检索器文档，然后 Cross-Encoder 将根据最相关的文档为你排序。

import numpy as np
#cross encoder reranker
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Extract text content from Document objects and convert to strings
document_texts = [doc.page_content for doc in retrieved_documents]
query_text = "What were the most important factors that contributed to increases in revenue?"
# Create pairs as strings
pairs = [[query_text, doc_text] for doc_text in document_texts]
# Predict scores for pairs
scores = cross_encoder.predict(pairs)
# Print scores
print("Scores:")
for score in scores:
    print(score)

print("New Ordering:")
for o in np.argsort(scores)[::-1]:
    print(o+1)

你可以使用这个答案或来自 reranker 的答案，并做进一步处理，比如将顶级文档传递给 LLM，以获得最终答案。

了解 ColBERT：重排序模型

ColBERT 是一种文档重排序器模型，它在 BERT 的基础上使用了后期交互架构，旨在提高文档检索和文档排序的性能。它在兼顾计算效率和高准确性方面尤为突出。

核心理念与架构

ColBERT 使用 BERT 分离查询文本和文档文本的编码，允许离线预计算文档编码。这大大减少了每次查询的计算负荷。该模型采用一种独特的方法，将每个查询和文档标记编码成一个低维向量，从而促进快速准确的检索。

后期互动机制

ColBERT 效率的关键在于其后期交互机制。它不是将所有标记向量压缩成一个向量，而是将查询的每个向量与文档的每个向量进行比较。这种方法可以确保更细致、更准确地反映文档与查询的相关性。

Colbert 中的索引和检索

ColBERT 的索引编制过程分为三个阶段：

中心点选择：使用 k-means 聚类法为残差编码选择中心点。
段落编码：用选定的中心点对文档进行编码，并计算量化的残差。
索引反转：创建按中心点分组的嵌入反转列表，以便快速检索。

在检索过程中，ColBERT 可有效计算每个查询向量的余弦相似度，从而快速、准确地对文档进行排序。

实际应用

我们将使用 ColBERT reranker 和 LanceDB，LanceDB 提供了一个界面，可以选择不同的混合排序方法来查询文档。

from lancedb.rerankers import ColbertReranker
db = lancedb.connect("/tmp/db")
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("openai").create()
class Words(LanceModel):
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()
table = db.create_table("colbertv2demo", schema=Words,mode="overwrite")
# data from retriver
formatted_data = [{"text": doc.page_content} for doc in retrieved_documents]
# ingest docs with auto-vectorization
table.add(formatted_data)
# Create the FTS index on the 'text' field
table.create_fts_index(['text'],replace=True)
# colbertReranker
reranker_colbert = ColbertReranker()
results_colbert = table.search("technologies and business models", query_type="hybrid").rerank(reranker=reranker_colbert).to_pandas()

以下是经过重新排序的 ColBERT v2 模型得出的结果。

FlashRank

FlashRank 是一个超轻、超快的 Python 库，可为你现有的搜索和检索管道添加重新排名功能。它基于 SoTA 交叉编码器。在运行以下代码前，请确保你已通过 pip install flashrank。

from flashrank import Ranker, RerankRequest
query = 'What were the most important factors that contributed to increases in revenue?'
ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", cache_dir="/opt")
rerankrequest = RerankRequest(query=query, passages=formatted_data)
results = ranker.rerank(rerankrequest)
print(results)

python

结论

基于 LLM 的查询扩展与 Cross-Encoder ColBERT v2 和 FlashReranker 等高级重排模型的整合为信息检索领域带来了新的可能性。这些方法不仅提高了文档检索系统的精确度和召回率，还确保了 RAG 模型能够提供高度相关、上下文更丰富的结果。随着我们在这一领域的不断探索和创新，这些工具将在各种用例和领域中变得更加易用和常见。

文章来源：https://medium.com/@aksdesai1998/improving-rag-with-query-expansion-reranking-models-31d252856580

标签：

人工智能 RAG

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇使用TensorRT LLM将LLM部署到生产中

下一篇 Functime：用于高效时间序列特征提取和预测的Python库

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来