构建“Auto-Analyst”:数据分析人工智能代理系统

2024年07月01日 由 alex 发表 171 0

我一直在开发人工智能驱动的代理,以减少我作为数据科学家/分析师的工作量。虽然流行文化中经常出现人工智能取代人类工作的场景,但实际上,大多数人工智能代理并不是人类的替代品。相反,它们能帮助我们提高工作效率。这个代理就是为了做到这一点而设计的。


设计


12


流程图展示了一个以用户定义的目标为起点的系统。然后,规划器代理将任务分配给一组工作代理,每个工作代理负责生成代码,以解决该问题的特定部分。最后,代码组合器代理收集并整合所有单独的代码片段,形成一个单一、连贯的脚本,完成整个目标。


注:规划器代理可以委托给部分代理,而不一定是所有代理。此外,每个代理都会有自己的输入集,图中没有显示。


单个组件

本文将指导你逐步构建代理,并介绍每个单独组件的代码块。在下一阶段中,我们将演示这些组件如何无缝集成。


规划器代理

规划器代理需要三个输入:用户定义的目标、可用数据集和代理描述。它按照以下格式输出计划:


代理 1-> 代理 2> 代理 3....


# You can use other orchestration libraries but I found DSPy
# good for building fast, simpler and evaluation (making the application more relibale)
import dspy
# This object inherits from the dspy.Signature class
# The text inside """ is the prompt
class analytical_planner(dspy.Signature):
    """ You are data analytics planner agent. You have access to three inputs
    1. Datasets
    2. Data Agent descriptions
    3. User-defined Goal
    You take these three inputs to develop a comprehensive plan to achieve the user-defined goal from the data & Agents available.
    In case you think the user-defined goal is infeasible you can ask the user to redefine or add more description to the goal.
    Give your output in this format:
    plan: Agent1->Agent2->Agent3
    plan_desc = Use Agent 1 for this reason, then agent2 for this reason and lastly agent3 for this reason.
    You don't have to use all the agents in response of the query
    
    """
# Input fields and their descriptions
    dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns  set df as copy of df_name")
    Agent_desc = dspy.InputField(desc= "The agents available in the system")
    goal = dspy.InputField(desc="The user defined goal ")
# Output fields and their description
    plan = dspy.OutputField(desc="The plan that would achieve the user defined goal")
    plan_desc= dspy.OutputField(desc="The reasoning behind the chosen plan")


13


分析代理

大多数分析代理都有一个共同的结构,只是在提示方面略有不同。它们接受两个输入:用户定义的目标和数据集索引。它们会产生两个输出:分析代码和评注,评注可用于调试或重定向代理。


# I define analysis agents as those agents that are in the middle-layer
# they produce code for a specialised data analysis task
class preprocessing_agent(dspy.Signature):
    """ You are a data pre-processing agent, your job is to take a user-defined goal and available dataset,
    to build an exploratory analytics pipeline. You do this by outputing the required Python code. 
    You will only use numpy and pandas, to perform pre-processing and introductory analysis
    """
    dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns  set df as copy of df_name")
    goal = dspy.InputField(desc="The user defined goal ")
    commentary = dspy.OutputField(desc="The comments about what analysis is being performed")
    code = dspy.OutputField(desc ="The code that does the data preprocessing and introductory analysis")
class statistical_analytics_agent(dspy.Signature):
    """ You are a statistical analytics agent. 
    Your task is to take a dataset and a user-defined goal, and output 
    Python code that performs the appropriate statistical analysis to achieve that goal.
    You should use the Python statsmodel library"""
    dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns  set df as copy of df_name")
    goal = dspy.InputField(desc="The user defined goal for the analysis to be performed")
    commentary = dspy.OutputField(desc="The comments about what analysis is being performed")
    code = dspy.OutputField(desc ="The code that does the statistical analysis using statsmodel")
class sk_learn_agent(dspy.Signature):
# Prompt
    """You are a machine learning agent. 
    Your task is to take a dataset and a user-defined goal, and output Python code that performs the appropriate machine learning analysis to achieve that goal. 
    You should use the scikit-learn library."""
# Input Fields
    dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns. set df as copy of df_name")
    goal = dspy.InputField(desc="The user defined goal ")
# Output Fields
    commentary = dspy.OutputField(desc="The comments about what analysis is being performed")
    code = dspy.OutputField(desc ="The code that does the Exploratory data analysis")
## I worked on the data-viz agent and already optimized using DSPy.
## The only big difference is that this agents takes another input of styling index


14


代码组合器代理

该代理的目的是将所有代理的输出整理成一个连贯的脚本。它接收一长串代码,并输出代码。


class code_combiner_agent(dspy.Signature):
    """ You are a code combine agent, taking Python code output from many agents and combining the operations into 1 output
    You also fix any errors in the code"""
    agent_code_list =dspy.InputField(desc="A list of code given by each agent")
    refined_complete_code = dspy.OutputField(desc="Refined complete code base")


可选代理/索引

为了让代理工作得更顺利,并捕捉到一些错误,我还建立了这些额外的代理或索引。


# The same signature used in Data Viz agent post
class Data_Viz(dspy.Signature):
    """
    You are AI agent who uses the goal to generate data visualizations in Plotly.
    You have to use the tools available to your disposal
    {dataframe_index}
    {styling_index}
    You must give an output as code, in case there is no relevant columns, just state that you don't have the relevant information
    """
    goal = dspy.InputField(desc="user defined goal which includes information about data and chart they want to plot")
    dataframe_context = dspy.InputField(desc=" Provides information about the data in the data frame. Only use column names and dataframe_name as in this context")
    styling_context = dspy.InputField(desc='Provides instructions on how to style your Plotly plots')
    code= dspy.OutputField(desc="Plotly code that visualizes what the user needs according to the query & dataframe_index & styling_context")
# An optional agent that checks if the user-defined goal works well
class goal_refiner_agent(dspy.Signature):
    """You take a user-defined goal given to a AI data analyst planner agent, 
    you make the goal more elaborate using the datasets available and agent_desc"""
    dataset = dspy.InputField(desc="Available datasets loaded in the system, use this df_name,columns  set df as copy of df_name")
    Agent_desc = dspy.InputField(desc= "The agents available in the system")
    goal = dspy.InputField(desc="The user defined goal ")
    refined_goal = dspy.OutputField(desc='Refined goal that helps the planner agent plan better')


我没有输入整个数据集的信息,而是建立了一个检索器,用于接收可用数据的信息。


# I choose a LLama-Index based retriever as it was more convenient.
# Basically you can feed your data in a multiple ways.
# Providing description about column names, dataframe reference
# And also what purpose the data was collected etc.
dataframe_index =  VectorStoreIndex.from_documents(docs)
# I also defined a styling index for the data visualization agent.
# Which has natural language instructions on how to style different visualizations
style_index =  VectorStoreIndex.from_documents(styling_instructions)


将所有内容整合为一个系统

在 DSPy 中,要编译一个复杂的 LLM 应用程序,需要定义一个包含两个基本方法的模块: __init__ 和 forward。


__init__ 方法通过定义所有将在整个过程中使用的变量来初始化模块。而 forward 方法则是实现核心功能的地方。该方法概述了一个组件的输出如何与其他组件交互,从而有效地驱动应用程序的逻辑。


# This module takes only one input on initiation
class auto_analyst(dspy.Module):
    def __init__(self,agents):
# Defines the available agents, their inputs, and description
        self.agents = {}
        self.agent_inputs ={}
        self.agent_desc =[]
        i =0
        for a in agents:
            name = a.__pydantic_core_schema__['schema']['model_name']
# Using CoT prompting as from experience it helps generate better responses
            self.agents[name] = dspy.ChainOfThought(a)
            agent_inputs[name] ={x.strip() for x in str(agents[i].__pydantic_core_schema__['cls']).split('->')[0].split('(')[1].split(',')}
            self.agent_desc.append(str(a.__pydantic_core_schema__['cls']))
            i+=1
# Defining the planner, refine_goal & code combiner agents seperately
# as they don't generate the code & analysis they help in planning, 
# getting better goals & combine the code
        self.planner = dspy.ChainOfThought(analytical_planner)
        self.refine_goal = dspy.ChainOfThought(goal_refiner_agent)
        self.code_combiner_agent = dspy.ChainOfThought(code_combiner_agent)
# these two retrievers are defined using llama-index retrievers
# you can customize this depending on how you want your agents
        self.dataset =dataframe_index.as_retriever(k=1)
        self.styling_index = style_index.as_retriever(similarity_top_k=1)
        
    def forward(self, query):
# This dict is used to quickly pass arguments for agent inputs
        dict_ ={}
# retrieves the relevant context to the query
        dict_['dataset'] = self.dataset.retrieve(query)[0].text
        dict_['styling_index'] = self.styling_index.retrieve(query)[0].text
        dict_['goal']=query
        dict_['Agent_desc'] = str(self.agent_desc)
# output_dictionary that stores all agent outputs
        output_dict ={}
# this comes up with the plan
        plan = self.planner(goal =dict_['goal'], dataset=dict_['dataset'], Agent_desc=dict_['Agent_desc'] )
        output_dict['analytical_planner'] = plan
        plan_list =[]
        code_list =[]
# if the planner worked as intended it should give agents seperated by ->
        if plan.plan.split('->'):
            plan_list = plan.plan.split('->')
# in case the goal is unclear, it sends it to refined goal agent
        else:
            refined_goal = self.refine_goal(dataset=data, goal=goal, Agent_desc= self.agent_desc)
            forward(query=refined_goal)
# passes the goal and other inputs to all respective agents in the plan
        for p in plan_list:
            inputs = {x:dict_[x] for x in agent_inputs[p.strip()]}
            output_dict[p.strip()]=self.agents[p.strip()](**inputs)
# creates a list of all the generated code, to be combined as 1 script
            code_list.append(output_dict[p.strip()].code)
# Stores the last output
        output_dict['code_combiner_agent'] = self.code_combiner_agent(agent_code_list = str(code_list))
        
        return output_dict
# you can store all available agent signatures as a list
agents =[preprocessing_agent, statistical_analytics_agent, sk_learn_agent,data_viz_agent]
# Define the agentic system
auto_analyst_system = auto_analyst(agents)
# the system is preloaded with Chicago crime data
goal = "What is the cause of crime in Chicago?"
# Asking the agentic system to perform analysis for this query
output = auto_analyst_system(query = goal)


现在逐步查看查询结果。


对于这个查询 ="芝加哥的犯罪原因是什么?


15


执行计划,首先对代理进行预处理


16


下一个统计分析代理


17


下一个 Plotly 数据可视化代理


18


最后是代码组合器代理,将所有代码组合在一起


19


这是执行最后一个代理的代码后的输出结果。


20


21


与许多代理一样,当它按预期工作时,它的表现非常出色。这只是我旨在随着时间的推移而改进的项目的第一次迭代。

文章来源:https://medium.com/firebird-technologies/building-auto-analyst-a-data-analytics-ai-agentic-system-3ac2573dcaf0
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
写评论取消
回复取消