深入剖析LangSmith: 评估由LangChain开发的三大核心组件

2024年01月24日由 alex 发表 1239 0

简介

我非常喜欢笔记本与 LangSmith 图形用户界面无缝集成的方式。通过代码执行任务，并通过网络图形用户界面查看结果。

应用程序可在项目下分类，提示可通过提示中心访问。数据集和测试可保存数据，并与代理运行相关联。

LangChain 在很大程度上实现了基于 LLM 的生成式应用开发的民主化。

使用 LangChain 进行原型开发是一回事，但将应用推向生产则是另一回事。

LangSmith 是 LangChain 的配套技术，可协助实现可观察性、可检查性、测试和持续改进。

在运行自主代理时，LangSmith 尤其有用，它能显示代理序列中的不同步骤或链条。此外，当向 LLM 发送多个并行请求时，LangSmith 也能提供帮助。

在本文中，我只考虑了 LangSmith 五个工具中的三个：项目、数据集与测试和集线器。

记录跟踪时运行代理

每当在 LangSmith 中创建一个新项目时，设置选项卡下都会显示代码片段，可将其包含在代码中，以便引用并将跟踪记录到 LangSmith 项目中。

下面是安装 LangChain 所需组件的 Python 代码。请注意环境变量的设置。你可以在 Colab 笔记本中完全运行此应用程序。

%pip install --upgrade --quiet  langchain langsmith langchainhub --quiet
%pip install --upgrade --quiet  langchain-openai tiktoken pandas duckduckgo-search --quiet
import os
from uuid import uuid4
unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith_WalkThrough"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "<Your LangSmith API Key>"  # Update to your API key

# Used by the agent in this tutorial
os.environ["OPENAI_API_KEY"] = "<Your OpenAI API Key>"

定义代理

考虑到代理，请注意为该代理定义的唯一工具：

tools = [
    DuckDuckGoSearchResults(
        name="duck_duck_go"
    ),  # General internet search using DuckDuckGo
]

该代理中心的提示通过下面的代码获取：

prompt = hub.pull("wfh/langsmith-agent-prompt:5d466cbc")

并从提示中心检索提示：

input_variables=['agent_scratchpad', 'input'] 
input_types={'agent_scratchpad': 
    typing.List[typing.Union[langchain_core.messages.ai.AIMessage, 
    langchain_core.messages.human.HumanMessage, 
    langchain_core.messages.chat.ChatMessage, 
    langchain_core.messages.system.SystemMessage, 
    langchain_core.messages.function.FunctionMessage, 
    langchain_core.messages.tool.ToolMessage]]} 
    messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], 
   template='
      You are an expert senior software engineer. 
      You are responsible for answering questions about LangChain. 
      Use functions to consult the documentation before answering.')), 
 HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], 
template='{input}')), MessagesPlaceholder(variable_name='agent_scratchpad')]

运行代理的代码：

from langsmith import Client
client = Client()
from langchain import hub
from langchain.agents import AgentExecutor
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.tools import DuckDuckGoSearchResults
from langchain_community.tools.convert_to_openai import format_tool_to_openai_function
from langchain_openai import ChatOpenAI
# Fetches the latest version of this prompt
prompt = hub.pull("wfh/langsmith-agent-prompt:5d466cbc")
llm = ChatOpenAI(
    model="gpt-3.5-turbo-16k",
    temperature=0,
)
tools = [
    DuckDuckGoSearchResults(
        name="duck_duck_go"
    ),  # General internet search using DuckDuckGo
]
llm_with_tools = llm.bind(functions=[format_tool_to_openai_function(t) for t in tools])
runnable_agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_function_messages(
            x["intermediate_steps"]
        ),
    }
    | prompt
    | llm_with_tools
    | OpenAIFunctionsAgentOutputParser()
)
agent_executor = AgentExecutor(
    agent=runnable_agent, tools=tools, handle_parsing_errors=True
)
inputs = [
    "What is LangChain?",
    "What's LangSmith?",
    "When was Llama-v2 released?",
    "What is the langsmith cookbook?",
    "When did langchain first announce the hub?",
]
results = agent_executor.batch([{"input": x} for x in inputs], return_exceptions=True)

以及代理的输出：

[{'input': 'What is LangChain?',
  'output': 'I\'m sorry, but I couldn\'t find any information about "LangChain". Could you please provide more context or clarify your question?'},
 {'input': "What's LangSmith?",
  'output': 'I\'m sorry, but I couldn\'t find any information about "LangSmith". It could be a company, a product, or a person. Can you provide more context or details about what you are referring to?'},
 {'input': 'When was Llama-v2 released?',
  'output': 'Llama-v2 was released on July 18, 2023.'},
 {'input': 'What is the langsmith cookbook?',
  'output': 'The Langsmith Cookbook is a collection of recipes and cooking techniques created by Langsmith, a fictional character. It is a comprehensive guide that covers a wide range of cuisines and dishes. The cookbook includes step-by-step instructions, ingredient lists, and tips for successful cooking. Whether you are a beginner or an experienced cook, the Langsmith Cookbook can help you enhance your culinary skills and create delicious meals.'},
 {'input': 'When did langchain first announce the hub?',
  'output': 'LangChain first announced the LangChain Hub on September 5, 2023.'}]

很明显，代理的性能并不理想，因此可以使用 LangSmith 来评估和改进代理。

下图说明了可检查性；在该图中，可以检查代理的执行链。图中显示了来自 DuckDuckGo 集成的响应片段，该片段可作为提示的上下文学习。

创建 LangSmith 数据集

下面显示了用于测试运行的五个输入和输出示例。这些条目将用于衡量新代理的性能。

数据集是示例的集合，只不过是输入输出对，你可以将其用作应用程序的测试用例。

示例集的加载方式如下：

outputs = [
    "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
    "LangSmith is a unified platform for debugging, testing, and monitoring language model applications and agents powered by LangChain",
    "July 18, 2023",
    "The langsmith cookbook is a github repository containing detailed examples of how to use LangSmith to debug, evaluate, and monitor large language model-powered applications.",
    "September 5, 2023",
]

还有

dataset_name = f"LangSmith Walkthrough"
dataset = client.create_dataset(
    dataset_name,
    description="An example dataset of questions over the LangSmith documentation.",
)
client.create_examples(
    inputs=[{"input": query} for query in inputs],
    outputs=[{"output": answer} for answer in outputs],
    dataset_id=dataset.id,
)

定义代理基准

下面将定义一个使用 OpenAI 函数调用端点的代理。

from langchain import hub
from langchain.agents import AgentExecutor, AgentType, initialize_agent, load_tools
from langchain.agents.format_scratchpad import format_to_openai_function_messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain_community.tools.convert_to_openai import format_tool_to_openai_function
from langchain_openai import ChatOpenAI

# Since chains can be stateful (e.g. they can have memory), we provide
# a way to initialize a new chain for each row in the dataset. This is done
# by passing in a factory function that returns a new chain for each row.
def create_agent(prompt, llm_with_tools):
    runnable_agent = (
        {
            "input": lambda x: x["input"],
            "agent_scratchpad": lambda x: format_to_openai_function_messages(
                x["intermediate_steps"]
            ),
        }
        | prompt
        | llm_with_tools
        | OpenAIFunctionsAgentOutputParser()
    )
    return AgentExecutor(agent=runnable_agent, tools=tools, handle_parsing_errors=True)

配置评估

在用户界面中手动比较链的结果是有效的，但可能会很耗时。

自动使用指标和人工智能辅助反馈来评估代理性能更省时省力。

下面是创建自定义运行评估器的代码，可记录启发式评估。

from langsmith.evaluation import EvaluationResult, run_evaluator
from langsmith.schemas import Example, Run

@run_evaluator
def check_not_idk(run: Run, example: Example):
    """Illustration of a custom evaluator."""
    agent_response = run.outputs["output"]
    if "don't know" in agent_response or "not sure" in agent_response:
        score = 0
    else:
        score = 1
    # You can access the dataset labels in example.outputs[key]
    # You can also access the model inputs in run.inputs[key]
    return EvaluationResult(
        key="not_uncertain",
        score=score,
    )

下面将定义自定义评估器，并将结果与地面实况标签进行比较。

使用嵌入距离等测量语义相似性。

from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig
evaluation_config = RunEvalConfig(
    # Evaluators can either be an evaluator type (e.g., "qa", "criteria", "embedding_distance", etc.) or a configuration for that evaluator
    evaluators=[
        # Measures whether a QA response is "Correct", based on a reference answer
        # You can also select via the raw string "qa"
        EvaluatorType.QA,
        # Measure the embedding distance between the output and the reference answer
        # Equivalent to: EvalConfig.EmbeddingDistance(embeddings=OpenAIEmbeddings())
        EvaluatorType.EMBEDDING_DISTANCE,
        # Grade whether the output satisfies the stated criteria.
        # You can select a default one such as "helpfulness" or provide your own.
        RunEvalConfig.LabeledCriteria("helpfulness"),
        # The LabeledScoreString evaluator outputs a score on a scale from 1-10.
        # You can use default criteria or write our own rubric
        RunEvalConfig.LabeledScoreString(
            {
                "accuracy": """
Score 1: The answer is completely unrelated to the reference.
Score 3: The answer has minor relevance but does not align with the reference.
Score 5: The answer has moderate relevance but contains inaccuracies.
Score 7: The answer aligns with the reference but has minor errors or omissions.
Score 10: The answer is completely accurate and aligns perfectly with the reference."""
            },
            normalize_by=10,
        ),
    ],
    # You can add custom StringEvaluator or RunEvaluator objects here as well, which will automatically be
    # applied to each prediction. Check out the docs for examples.
    custom_evaluators=[check_not_idk],
)

该提示是从 LangSmith 提示中心导入的：

from langchain import hub
# We will test this version of the prompt
prompt = hub.pull("wfh/langsmith-agent-prompt:798e7324")
print (prompt)

运行评估程序，从指定数据集中获取示例行。

在每个示例上运行代理，并将评估程序应用于生成的运行轨迹，同时自动生成反馈。

结果在 LangSmith 中可见。

import functools
from langchain.smith import arun_on_dataset, run_on_dataset
chain_results = run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=functools.partial(
        create_agent, prompt=prompt, llm_with_tools=llm_with_tools
    ),
    evaluation=evaluation_config,
    verbose=True,
    client=client,
    project_name=f"runnable-agent-test-5d466cbc-{unique_id}",
    # Project metadata communicates the experiment parameters,
    # Useful for reviewing the test results
    project_metadata={
        "env": "testing-notebook",
        "model": "gpt-3.5-turbo",
        "prompt": "5d466cbc",
    },
)
# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.
# These are logged as warnings here and captured as errors in the tracing UI.

在 LangSmith 中，可以查看输入、示例参考输出和测试结果。

可检查性

请注意，在下图中，当在笔记本中运行评估时，会创建链接。第一个链接允许查看评估结果，也可以查看数据集的所有测试。

文章来源：https://medium.com/@cobusgreyling/langsmith-by-langchain-ce1742d44f24

标签：

人工智能大模型语言

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇如何合并大型语言模型？

下一篇使用7B级模型构建多用户聊天助手

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来