使用LlamaIndex构建你自己的PandasAI

2023年09月04日由 alex 发表 707 0

介绍

Pandas AI是一个Python库，它利用生成式AI的强大功能来增强流行的数据分析库Pandas。通过一个简单的提示符，Pandas AI允许你执行以前需要许多行代码复杂的数据清理、分析和可视化。

除了处理数字，Pandas AI 还能理解自然语言。你可以用简单的英语询问有关数据的问题，它将提供日常语言的摘要和见解，使你不必破译复杂的图形和表格。

在下面的例子中，我们提供了一个Pandas数据框架，并要求生成式AI创建一个条形图。结果令人印象深刻。

pandas_ai.run(df, prompt='Plot the bar chart of type of media for each year release, using different colors.')

2-2

在本文中，我们将使用LlamaIndex来创建类似的工具，可以理解Pandas数据框架并产生如上所示的复杂结果。

LlamaIndex支持通过聊天和代理对数据进行自然语言查询。它允许大型语言模型大规模地解释私有数据，而无需对新数据进行重新训练。它将大型语言模型与各种数据源和工具集成在一起。LlamaIndex是一个数据框架，只需几行代码就可以轻松创建带PDF的Chat应用程序。

配置

你可以使用pip命令安装Python库。

pip install llama-index

默认情况下，LlamaIndex使用OpenAI gpt-3.5 turbo模型进行文本生成，使用text- embedting -ada-002进行检索和嵌入。为了轻松地运行代码，我们必须设置OPENAI_API_KEY。

import os
os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"

它们还支持Anthropic、Hugging Face、PaLM和更多模型的集成。

Pandas查询引擎

让我们进入创建自己的Pandas AI的主题。在安装库并设置API密钥之后，我们将创建一个简单的城市数据框架，其中城市名称和人口作为列。

import pandas as pd
from llama_index.query_engine.pandas_query_engine import PandasQueryEngine

df = pd.DataFrame(
    {"city": ["New York", "Islamabad", "Mumbai"], "population": [8804190, 1009832, 12478447]}
)

使用PandasQueryEngine，我们将创建一个查询引擎来加载数据框并为其建立索引。

之后，我们将编写查询并显示响应。

query_engine = PandasQueryEngine(df=df)
response = query_engine.query(
    "What is the city with the lowest population?",
)

正如我们所看到的，它已经开发了用于在数据框中显示人口最少的城市的Python代码。

> Pandas Instructions:
```
eval("df.loc[df['population'].idxmin()]['city']")
```
eval("df.loc[df['population'].idxmin()]['city']")
> Pandas Output: Islamabad

如果输出响应，就会得到“Islamabad”。它很简单，但令人印象深刻。你不需要提出自己的逻辑或围绕代码进行实验。只要输入问题，你就会得到答案。

print(response)

Islamabad

你还可以使用响应元数据打印结果背后的代码。

print(response.metadata["pandas_instruction_str"])

eval("df.loc[df['population'].idxmin()]['city']")

全球YouTube统计分析

在第二个示例中，我们将从Kaggle加载Global YouTube Statistics 2023数据集，并执行一些基本分析。它比简单的例子更进了一步。

我们将使用read_csv将数据集加载到查询引擎中。然后，我们将编写提示符，只显示缺少值的列和缺少值的数量。

df_yt = pd.read_csv("Global YouTube Statistics.csv")
query_engine = PandasQueryEngine(df=df_yt, verbose=True)
response = query_engine.query(
    "List the columns with missing values and the number of missing values. Only show missing values columns.",
)

> Pandas Instructions:
```
df.isnull().sum()[df.isnull().sum() > 0]
```
df.isnull().sum()[df.isnull().sum() > 0]
> Pandas Output: category                                    46
Country                                    122
Abbreviation                               122
channel_type                                30
video_views_rank                             1
country_rank                               116
channel_type_rank                           33
video_views_for_the_last_30_days            56
subscribers_for_last_30_days               337
created_year                                 5
created_month                                5
created_date                                 5
Gross tertiary education enrollment (%)    123
Population                                 123
Unemployment rate                          123
Urban_population                           123
Latitude                                   123
Longitude                                  123
dtype: int64

现在，我们将直接询问有关流行频道类型的问题。在我看来，LlamdaIndex查询引擎是非常准确的，并且还没有产生任何幻觉。

response = query_engine.query(
    "Which channel type have the most views.",
)

> Pandas Instructions:
```
eval("df.groupby('channel_type')['video views'].sum().idxmax()")
```
eval("df.groupby('channel_type')['video views'].sum().idxmax()")
> Pandas Output: Entertainment
Entertainment

最后，我们将要求它可视化聊天，结果是惊人的。

response = query_engine.query(
    "Visualize barchat of top ten youtube channels based on subscribers and add the title.",
)

> Pandas Instructions:
```
eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(kind='bar', x='Youtuber', y='subscribers', title='Top Ten YouTube Channels Based on Subscribers')")
```
eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(kind='bar', x='Youtuber', y='subscribers', title='Top Ten YouTube Channels Based on Subscribers')")
> Pandas Output: AxesSubplot(0.125,0.11;0.775x0.77)

2-3

通过一个简单的提示和查询引擎，我们可以自动进行数据分析并执行复杂的任务。LamaIndex还有很多其他的功能。

结论

总之，LlamaIndex是一个令人兴奋的新工具，它允许开发人员创建自己的PandasAI -利用大型语言模型的强大功能进行直观的数据分析和对话。通过索引和嵌入LlamaIndex数据集，你可以在不影响安全性或重新训练模型的情况下，在你的私人数据上启用高级自然语言功能。

使用LlamaIndex，你可以在文档，聊天机器人，自动化AI，知识图谱，AI SQL查询引擎，全栈Web应用程序上构建问答，并构建私有生成AI应用程序。

文章来源：https://www.kdnuggets.com/build-your-own-pandasai-with-llamaindex

标签：

Python Pandas AI

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇 LlamaIndex：代码库的自动知识转移 (KT) 生成

下一篇 Python数据结构入门的5个步骤

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来