想象一下,你可以通过一个简单的聊天界面,分析超过 15 万个数据集。有了 Vanna.AI 和 DuckDB,这一切都变得简单易行。
Vanna.AI 提供了一种使用任何 LLM 构建文本到 SQL 管道的方法,而 DuckDB 则允许你连接到外部数据源,如 Hugging Face。
连接到 Vanna AI
要连接到 Vanna,首先需要从这里获取免费 API!Vanna 允许你连接到任何 LLM,但本文使用的是 GPT 4o。这就是使用 OpenAI api 密钥进行连接的方法。
from vanna.openai import OpenAI_Chat
from vanna.vannadb import VannaDB_VectorStore
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
def __init__(self, config=None):
MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
OpenAI_Chat.__init__(self, config=config)
# Add your OpenAI api_key
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4o'})
连接到 DuckDB
DuckDB 允许你直接从 HuggingFace 下载数据集,该过程的第一步是连接到 DuckDB
#This is how you can connect to a DuckDB database
vn.connect_to_duckdb(url='motherduck:?motherduck_token=<token>')
连接后,首先探索 Hugging Face 提供的数据集!
下载数据集
你可以为这篇文章选择任何你喜欢的数据集;这篇文章显示的是 Fineweb 数据集!
为了在你的环境中加载数据集,你需要像这样传递数据集引用:
hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩
你可以运行此 SQL 下载数据集。
#Running this will download the dataset
vn.run_sql("""
CREATE TABLE Fineweb AS
SELECT * FROM 'hf://datasets/HuggingFaceFW/fineweb/data/CC-MAIN-2013-20/000_00000.parquet';
SELECT * FROM Fineweb""")
训练
Vanna 可帮助开发 RAG 应用程序,该程序可了解你的数据库模式。
计划培训
# The information schema query may need some tweaking depending on your database. This is a good starting point.
df_information_schema = vn.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
plan
# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)
DDL 培训
# In duckDB the describe statement can fetch the DDL for any table
vn.train(ddl="DESCRIBE SELECT * FROM FineWeb")
问题/SQL 对培训
# here is an example of training on SQL statements
# In this data set we calculate the most common words in the text column
vn.train(sql ="""
SELECT word, COUNT(*) AS frequency
FROM (
SELECT UNNEST(STRING_SPLIT(LOWER(text), ' ')) AS word
FROM WIKIPEDIA
) AS words
GROUP BY word
ORDER BY frequency DESC
LIMIT 10;
""", question ="What are the most common words in text?")
文档培训
# We can use documentation to give explicit context that you would give to a data analyst
vn.train(documentation="The number of worker's column corresponds to people laid off")
聊天
你可以使用 Vanna 的提问功能提问或启动内置用户界面。
vn.ask("Show me the most common words in text", visualize=False)
你可以使用以下代码行启动 Flask 应用程序
from vanna.flask import VannaFlaskApp
app = VannaFlaskApp(vn)
app.run()