OpenAI 刚刚发布了其最新旗舰机型 GPT 4o。这款先进的模型具有多模态功能,能够处理图像、文本和视频输入,生成超越以往任何模型的响应。本文章重点介绍如何使用 GPT 4o 和 Vanna AI 构建文本到 SQL 管道,与数据库聊天。
入门
from vanna.openai import OpenAI_Chat
from vanna.vannadb import VannaDB_VectorStore
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
def __init__(self, config=None):
MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
OpenAI_Chat.__init__(self, config=config)
# Add your OpenAI api_key
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4o'})
连接数据库
Vanna 已为以下 8 个数据库内置了连接器(你只需额外编写几行代码即可连接到其他数据库):
在本篇文章中,我将连接到 DuckDB StackOverFlow 数据库。该数据库可在此处找到!
#This is how you can connect to a DuckDB database
vn.connect_to_duckdb(url='motherduck:[<database_name>]?motherduck_token=<token>&saas_mode=true')
训练
计划培训(信息模式)
# The information schema query may need some tweaking depending on your database. This is a good starting point.
df_information_schema = vn.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
plan
# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)
DDL 培训
# In duckDB the describe statement can fetch the DDL for any table
vn.train(ddl="DESCRIBE SELECT * FROM Stackoverflow.users;")
SQL 语句培训
# here is an example of training on SQL statements
vn.train(
question="What are the top 10 users with highest amount of Badges?"
,sql="""SELECT UserId, COUNT(*) AS badge_count
FROM stackoverflow.main.badges
GROUP BY UserId
ORDER BY badge_count DESC
LIMIT 10
""")
# Another example
vn.train(
question="What is the difference in total answers for the user who answered the most answers and the user who answered the least questions?",
,sql="SELECT MAX(answer_count) - MIN(answer_count) AS difference
FROM (
SELECT OwnerUserId, COUNT(*) AS answer_count
FROM stackoverflow.main.posts
WHERE PostTypeId = 2
GROUP BY OwnerUserId
) AS answer_counts;
")
文档培训
# You can feed in contextual information using documentation
vn.train(documentation="We call the user with the highest answers in a year the Grand master")
你可以使用 vn.get_training_data() 查看培训数据。
# vn.ask is runs these following functions in sequence, which can be run individually
# 1. vn.generate_ql
# 2. vn.run_sql
# 3. vn.generate_plotly_code
# 4. vn.get_plotly_figure
# this is how you can ask Vanna question's post training
vn.ask('Find the top 10 users with the highest amount of Badges?')
使用 Flask 应用程序
Vanna 自带一个内置用户界面 Flask 应用程序。它可以在 jupyter 笔记本或 Python 脚本中启动。
from vanna.flask import VannaFlaskApp
app = VannaFlaskApp(vn)
app.run()
这就是如何使用 GPT 4o 进行文本到 SQL 转换。
基准测试
下面是 GPT 4o 与其他同类机型的比较,可以看出它的准确率最高,达到 61%,ChatGPT 4 turbo 为 59%,Claude Opus 为 56