我一直在开发人工智能驱动的代理,以减少我作为数据科学家/分析师的工作量。虽然流行文化中经常出现人工智能取代人类工作的场景,但实际上,大多数人工智能代理并不是人类的替代品。相反,它们能帮助我们提高工作效率。这个代理就是为了做到这一点而设计的。
设计
流程图展示了一个以用户定义的目标为起点的系统。然后,规划器代理将任务分配给一组工作代理,每个工作代理负责生成代码,以解决该问题的特定部分。最后,代码组合器代理收集并整合所有单独的代码片段,形成一个单一、连贯的脚本,完成整个目标。
注:规划器代理可以委托给部分代理,而不一定是所有代理。此外,每个代理都会有自己的输入集,图中没有显示。
单个组件
本文将指导你逐步构建代理,并介绍每个单独组件的代码块。在下一阶段中,我们将演示这些组件如何无缝集成。
规划器代理
规划器代理需要三个输入:用户定义的目标、可用数据集和代理描述。它按照以下格式输出计划:
代理 1-> 代理 2> 代理 3....
# You can use other orchestration libraries but I found DSPy
# good for building fast, simpler and evaluation (making the application more relibale)
import dspy
# This object inherits from the dspy.Signature class
# The text inside """ is the prompt
class analytical_planner(dspy.Signature):
""" You are data analytics planner agent. You have access to three inputs
1. Datasets
2. Data Agent descriptions
3. User-defined Goal
You take these three inputs to develop a comprehensive plan to achieve the user-defined goal from the data & Agents available.
In case you think the user-defined goal is infeasible you can ask the user to redefine or add more description to the goal.
Give your output in this format:
plan: Agent1->Agent2->Agent3
plan_desc = Use Agent 1 for this reason, then agent2 for this reason and lastly agent3 for this reason.
You don't have to use all the agents in response of the query
"""
# Input fields and their descriptions
dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns set df as copy of df_name")
Agent_desc = dspy.InputField(desc= "The agents available in the system")
goal = dspy.InputField(desc="The user defined goal ")
# Output fields and their description
plan = dspy.OutputField(desc="The plan that would achieve the user defined goal")
plan_desc= dspy.OutputField(desc="The reasoning behind the chosen plan")
分析代理
大多数分析代理都有一个共同的结构,只是在提示方面略有不同。它们接受两个输入:用户定义的目标和数据集索引。它们会产生两个输出:分析代码和评注,评注可用于调试或重定向代理。
# I define analysis agents as those agents that are in the middle-layer
# they produce code for a specialised data analysis task
class preprocessing_agent(dspy.Signature):
""" You are a data pre-processing agent, your job is to take a user-defined goal and available dataset,
to build an exploratory analytics pipeline. You do this by outputing the required Python code.
You will only use numpy and pandas, to perform pre-processing and introductory analysis
"""
dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns set df as copy of df_name")
goal = dspy.InputField(desc="The user defined goal ")
commentary = dspy.OutputField(desc="The comments about what analysis is being performed")
code = dspy.OutputField(desc ="The code that does the data preprocessing and introductory analysis")
class statistical_analytics_agent(dspy.Signature):
""" You are a statistical analytics agent.
Your task is to take a dataset and a user-defined goal, and output
Python code that performs the appropriate statistical analysis to achieve that goal.
You should use the Python statsmodel library"""
dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns set df as copy of df_name")
goal = dspy.InputField(desc="The user defined goal for the analysis to be performed")
commentary = dspy.OutputField(desc="The comments about what analysis is being performed")
code = dspy.OutputField(desc ="The code that does the statistical analysis using statsmodel")
class sk_learn_agent(dspy.Signature):
# Prompt
"""You are a machine learning agent.
Your task is to take a dataset and a user-defined goal, and output Python code that performs the appropriate machine learning analysis to achieve that goal.
You should use the scikit-learn library."""
# Input Fields
dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns. set df as copy of df_name")
goal = dspy.InputField(desc="The user defined goal ")
# Output Fields
commentary = dspy.OutputField(desc="The comments about what analysis is being performed")
code = dspy.OutputField(desc ="The code that does the Exploratory data analysis")
## I worked on the data-viz agent and already optimized using DSPy.
## The only big difference is that this agents takes another input of styling index
代码组合器代理
该代理的目的是将所有代理的输出整理成一个连贯的脚本。它接收一长串代码,并输出代码。
class code_combiner_agent(dspy.Signature):
""" You are a code combine agent, taking Python code output from many agents and combining the operations into 1 output
You also fix any errors in the code"""
agent_code_list =dspy.InputField(desc="A list of code given by each agent")
refined_complete_code = dspy.OutputField(desc="Refined complete code base")
可选代理/索引
为了让代理工作得更顺利,并捕捉到一些错误,我还建立了这些额外的代理或索引。
# The same signature used in Data Viz agent post
class Data_Viz(dspy.Signature):
"""
You are AI agent who uses the goal to generate data visualizations in Plotly.
You have to use the tools available to your disposal
{dataframe_index}
{styling_index}
You must give an output as code, in case there is no relevant columns, just state that you don't have the relevant information
"""
goal = dspy.InputField(desc="user defined goal which includes information about data and chart they want to plot")
dataframe_context = dspy.InputField(desc=" Provides information about the data in the data frame. Only use column names and dataframe_name as in this context")
styling_context = dspy.InputField(desc='Provides instructions on how to style your Plotly plots')
code= dspy.OutputField(desc="Plotly code that visualizes what the user needs according to the query & dataframe_index & styling_context")
# An optional agent that checks if the user-defined goal works well
class goal_refiner_agent(dspy.Signature):
"""You take a user-defined goal given to a AI data analyst planner agent,
you make the goal more elaborate using the datasets available and agent_desc"""
dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns set df as copy of df_name")
Agent_desc = dspy.InputField(desc= "The agents available in the system")
goal = dspy.InputField(desc="The user defined goal ")
refined_goal = dspy.OutputField(desc='Refined goal that helps the planner agent plan better')
我没有输入整个数据集的信息,而是建立了一个检索器,用于接收可用数据的信息。
# I choose a LLama-Index based retriever as it was more convenient.
# Basically you can feed your data in a multiple ways.
# Providing description about column names, dataframe reference
# And also what purpose the data was collected etc.
dataframe_index = VectorStoreIndex.from_documents(docs)
# I also defined a styling index for the data visualization agent.
# Which has natural language instructions on how to style different visualizations
style_index = VectorStoreIndex.from_documents(styling_instructions)
将所有内容整合为一个系统
在 DSPy 中,要编译一个复杂的 LLM 应用程序,需要定义一个包含两个基本方法的模块: __init__ 和 forward。
__init__ 方法通过定义所有将在整个过程中使用的变量来初始化模块。而 forward 方法则是实现核心功能的地方。该方法概述了一个组件的输出如何与其他组件交互,从而有效地驱动应用程序的逻辑。
# This module takes only one input on initiation
class auto_analyst(dspy.Module):
def __init__(self,agents):
# Defines the available agents, their inputs, and description
self.agents = {}
self.agent_inputs ={}
self.agent_desc =[]
i =0
for a in agents:
name = a.__pydantic_core_schema__['schema']['model_name']
# Using CoT prompting as from experience it helps generate better responses
self.agents[name] = dspy.ChainOfThought(a)
agent_inputs[name] ={x.strip() for x in str(agents[i].__pydantic_core_schema__['cls']).split('->')[0].split('(')[1].split(',')}
self.agent_desc.append(str(a.__pydantic_core_schema__['cls']))
i+=1
# Defining the planner, refine_goal & code combiner agents seperately
# as they don't generate the code & analysis they help in planning,
# getting better goals & combine the code
self.planner = dspy.ChainOfThought(analytical_planner)
self.refine_goal = dspy.ChainOfThought(goal_refiner_agent)
self.code_combiner_agent = dspy.ChainOfThought(code_combiner_agent)
# these two retrievers are defined using llama-index retrievers
# you can customize this depending on how you want your agents
self.dataset =dataframe_index.as_retriever(k=1)
self.styling_index = style_index.as_retriever(similarity_top_k=1)
def forward(self, query):
# This dict is used to quickly pass arguments for agent inputs
dict_ ={}
# retrieves the relevant context to the query
dict_['dataset'] = self.dataset.retrieve(query)[0].text
dict_['styling_index'] = self.styling_index.retrieve(query)[0].text
dict_['goal']=query
dict_['Agent_desc'] = str(self.agent_desc)
# output_dictionary that stores all agent outputs
output_dict ={}
# this comes up with the plan
plan = self.planner(goal =dict_['goal'], dataset=dict_['dataset'], Agent_desc=dict_['Agent_desc'] )
output_dict['analytical_planner'] = plan
plan_list =[]
code_list =[]
# if the planner worked as intended it should give agents seperated by ->
if plan.plan.split('->'):
plan_list = plan.plan.split('->')
# in case the goal is unclear, it sends it to refined goal agent
else:
refined_goal = self.refine_goal(dataset=data, goal=goal, Agent_desc= self.agent_desc)
forward(query=refined_goal)
# passes the goal and other inputs to all respective agents in the plan
for p in plan_list:
inputs = {x:dict_[x] for x in agent_inputs[p.strip()]}
output_dict[p.strip()]=self.agents[p.strip()](**inputs)
# creates a list of all the generated code, to be combined as 1 script
code_list.append(output_dict[p.strip()].code)
# Stores the last output
output_dict['code_combiner_agent'] = self.code_combiner_agent(agent_code_list = str(code_list))
return output_dict
# you can store all available agent signatures as a list
agents =[preprocessing_agent, statistical_analytics_agent, sk_learn_agent,data_viz_agent]
# Define the agentic system
auto_analyst_system = auto_analyst(agents)
# the system is preloaded with Chicago crime data
goal = "What is the cause of crime in Chicago?"
# Asking the agentic system to perform analysis for this query
output = auto_analyst_system(query = goal)
现在逐步查看查询结果。
对于这个查询 ="芝加哥的犯罪原因是什么?
执行计划,首先对代理进行预处理
下一个统计分析代理
下一个 Plotly 数据可视化代理
最后是代码组合器代理,将所有代码组合在一起
这是执行最后一个代理的代码后的输出结果。
与许多代理一样,当它按预期工作时,它的表现非常出色。这只是我旨在随着时间的推移而改进的项目的第一次迭代。