介绍
检索增强生成(Retrieval-augmented generation,简称RAG)引入了一种创新方法,将搜索系统的广泛检索能力与LlamaModel(简称LLM)相结合。在实施RAG系统时,决定系统效率和性能的一个关键参数是chunk_size(块大小)。如何确定无缝检索的最佳chunk_size呢?这就是LlamaIndex Response Evaluation模块的用武之地。在本博客文章中,我们将指导你如何使用LlamaIndex的Response Evaluation模块来确定最佳chunk_size。
为什么chunk_size很重要
选择合适的chunk_size是一个关键决策,可以以多种方式影响RAG系统的效率和准确性:
1. 相关性和粒度:较小的chunk_size,例如128,会产生更小的块。然而,这种粒度存在风险:重要信息可能不在前几个检索到的块中,尤其是如果相似度(similarity_top_k)设置为2这样限制性的情况下。相反,chunk_size为512的情况很可能包含所有必要的信息在前几个块中,确保查询的答案可以随时获得。为了解决这个问题,我们采用了忠实度(Faithfulness)和相关性(Relevancy)指标。这些指标分别基于查询和检索到的上下文来衡量回答中的"幻觉"缺失和"相关性"。
2. 响应生成时间:随着chunk_size的增加,传递到LLM以生成答案的信息量也会增加。虽然这可以确保更全面的上下文,但也可能降低系统速度。确保增加的深度不会影响系统的响应速度至关重要。
简而言之,确定最佳chunk_size是在平衡中取得:在不牺牲速度的前提下捕捉所有必要信息。进行多种大小的彻底测试以找到适合具体用例和数据集的配置非常重要。
设置
在进行实验之前,我们需要确保导入所有必要的模块。
import nest_asyncio
nest_asyncio.apply()
from llama_index import (
SimpleDirectoryReader,
VectorStoreIndex,
ServiceContext,
)
from llama_index.evaluation import (
DatasetGenerator,
FaithfulnessEvaluator,
RelevancyEvaluator
)
from llama_index.llms import OpenAI
import openai
import time
openai.api_key = 'OPENAI-API-KEY'
下载数据
为了这个实验,我们将使用2021年Uber的10K报表下载数据。
!mkdir -p 'data/10k/''data/10k/'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
加载数据
让我们加载我们的文件。
documents = SimpleDirectoryReader("./data/10k/").load_data()
问题生成
为了选择合适的chunk_size,我们将计算不同chunk_size的平均响应时间、忠实度和相关性等指标。DatasetGenerator将帮助我们从文档中生成问题。
data_generator = DatasetGenerator.from_documents(documents)
eval_questions = data_generator.generate_questions_from_nodes()
设置评估员
我们正在设置GPT-4模型,作为评估实验期间生成的回答的基础。两个评估器,即"忠实度评估器(Faithfulness Evaluator)"和"相关性评估器(Relevancy Evaluator)",使用"service_context" 进行初始化。
1. 忠实度评估器 - 用于衡量回答是否虚构,并衡量查询引擎的回答是否与任何源节点匹配。
2. 相关性评估器 - 用于衡量回答是否真正回答了查询,并衡量回答+源节点是否与查询匹配。
# We will use GPT-4 for evaluating the responses
gpt4 = OpenAI(temperature=0, model="gpt-4")
# Define service context for GPT-4 for evaluation
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)
# Define Faithfulness and Relevancy Evaluators which are based on GPT-4
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)
对于chunk_size的响应评估
我们根据三个指标评估每个chunk_size。
1. 平均响应时间。
2. 平均准确性。
3. 平均相关性。
这是一个名为evaluate_response_time_and_accuracy的函数,它可以完成这些操作:
1. 向量索引的创建。
2. 构建查询引擎。
3. 指标计算。
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
# We use GPT-3.5-Turbo to generate response and GPT-4 to evaluate it.
def evaluate_response_time_and_accuracy(chunk_size, eval_questions):
"""
Evaluate the average response time, faithfulness, and relevancy of responses generated by GPT-3.5-turbo for a given chunk size.
Parameters:
chunk_size (int): The size of data chunks being processed.
Returns:
tuple: A tuple containing the average response time, faithfulness, and relevancy metrics.
"""
total_response_time = 0
total_faithfulness = 0
total_relevancy = 0
# create vector index
llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size)
vector_index = VectorStoreIndex.from_documents(
eval_documents, service_context=service_context
)
# build query engine
query_engine = vector_index.as_query_engine()
num_questions = len(eval_questions)
# Iterate over each question in eval_questions to compute metrics.
# While BatchEvalRunner can be used for faster evaluations (see: https://docs.llamaindex.ai/en/latest/examples/evaluation/batch_eval.html),
# we're using a loop here to specifically measure response time for different chunk sizes.
for question in eval_questions:
start_time = time.time()
response_vector = query_engine.query(question)
elapsed_time = time.time() - start_time
faithfulness_result = faithfulness_gpt4.evaluate_response(
response=response_vector
).passing
relevancy_result = relevancy_gpt4.evaluate_response(
query=question, response=response_vector
).passing
total_response_time += elapsed_time
total_faithfulness += faithfulness_result
total_relevancy += relevancy_result
average_response_time = total_response_time / num_questions
average_faithfulness = total_faithfulness / num_questions
average_relevancy = total_relevancy / num_questions
return average_response_time, average_faithfulness, average_relevancy
在不同chunk_size下的测试
我们将评估一系列的块大小,以确定哪种提供了最有前景的指标
chunk_sizes = [128, 256, 512, 1024, 2048]128, 256, 512, 1024, 2048]
for chunk_size in chunk_sizes:
avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size, eval_questions)
print(f"Chunk size {chunk_size} - Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")
将一切整合在一起
我们来编译一下流程:
import nest_asyncio
nest_asyncio.apply()
from llama_index import (
SimpleDirectoryReader,
VectorStoreIndex,
ServiceContext,
)
from llama_index.evaluation import (
DatasetGenerator,
FaithfulnessEvaluator,
RelevancyEvaluator
)
from llama_index.llms import OpenAI
import openai
import time
openai.api_key = 'OPENAI-API-KEY'
# Download Data
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
# Load Data
reader = SimpleDirectoryReader("./data/10k/")
documents = reader.load_data()
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.
eval_documents = documents[:20]
data_generator = DatasetGenerator.from_documents()
eval_questions = data_generator.generate_questions_from_nodes(num = 20)
# We will use GPT-4 for evaluating the responses
gpt4 = OpenAI(temperature=0, model="gpt-4")
# Define service context for GPT-4 for evaluation
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)
# Define Faithfulness and Relevancy Evaluators which are based on GPT-4
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
def evaluate_response_time_and_accuracy(chunk_size):
total_response_time = 0
total_faithfulness = 0
total_relevancy = 0
# create vector index
llm = OpenAI(model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size)
vector_index = VectorStoreIndex.from_documents(
eval_documents, service_context=service_context
)
query_engine = vector_index.as_query_engine()
num_questions = len(eval_questions)
for question in eval_questions:
start_time = time.time()
response_vector = query_engine.query(question)
elapsed_time = time.time() - start_time
faithfulness_result = faithfulness_gpt4.evaluate_response(
response=response_vector
).passing
relevancy_result = relevancy_gpt4.evaluate_response(
query=question, response=response_vector
).passing
total_response_time += elapsed_time
total_faithfulness += faithfulness_result
total_relevancy += relevancy_result
average_response_time = total_response_time / num_questions
average_faithfulness = total_faithfulness / num_questions
average_relevancy = total_relevancy / num_questions
return average_response_time, average_faithfulness, average_relevancy
# Iterate over different chunk sizes to evaluate the metrics to help fix the chunk size.
for chunk_size in [128, 256, 512, 1024, 2048]
avg_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size)
print(f"Chunk size {chunk_size} - Average Response time: {avg_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")
结果
上表显示,随着块大小的增加,平均响应时间略微增加。有趣的是,平均忠实度似乎在块大小为1024时达到顶峰,而平均相关性在较大的块大小下一直呈现出持续改善,也在1024时达到峰值。这表明,块大小为1024可能在响应时间和响应质量(以忠实度和相关性衡量)之间取得了最佳平衡。
结论
确定适合RAG系统的最佳块大小既依靠直觉,又依靠实证证据。通过LlamaIndex的响应评估模块,你可以尝试使用不同的大小,并根据具体数据做出决策。在构建RAG系统时,请始终记住chunk_size是一个关键参数。花费时间仔细评估和调整你的chunk_size,以获得无与伦比的结果。