使用Vanna.AI和GPT 4o与150K HuggingFace数据集聊天

2024年07月03日由 alex 发表 383 0

想象一下，你可以通过一个简单的聊天界面，分析超过 15 万个数据集。有了 Vanna.AI 和 DuckDB，这一切都变得简单易行。

Vanna.AI 提供了一种使用任何 LLM 构建文本到 SQL 管道的方法，而 DuckDB 则允许你连接到外部数据源，如 Hugging Face。

连接到 Vanna AI

要连接到 Vanna，首先需要从这里获取免费 API！Vanna 允许你连接到任何 LLM，但本文使用的是 GPT 4o。这就是使用 OpenAI api 密钥进行连接的方法。

from vanna.openai import OpenAI_Chat
from vanna.vannadb import VannaDB_VectorStore
class MyVanna(VannaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        MY_VANNA_MODEL = # Your model name from https://vanna.ai/account/profile
        VannaDB_VectorStore.__init__(self, vanna_model=MY_VANNA_MODEL, vanna_api_key=MY_VANNA_API_KEY, config=config)
        OpenAI_Chat.__init__(self, config=config)
# Add your OpenAI api_key
vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-4o'})

连接到 DuckDB

DuckDB 允许你直接从 HuggingFace 下载数据集，该过程的第一步是连接到 DuckDB

#This is how you can connect to a DuckDB database
vn.connect_to_duckdb(url='motherduck:?motherduck_token=<token>')

连接后，首先探索 Hugging Face 提供的数据集！

下载数据集

你可以为这篇文章选择任何你喜欢的数据集；这篇文章显示的是 Fineweb 数据集！

为了在你的环境中加载数据集，你需要像这样传递数据集引用：

hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩

你可以运行此 SQL 下载数据集。

#Running this will download the dataset
vn.run_sql("""
CREATE TABLE Fineweb AS
SELECT * FROM 'hf://datasets/HuggingFaceFW/fineweb/data/CC-MAIN-2013-20/000_00000.parquet';
SELECT * FROM Fineweb""")

训练

Vanna 可帮助开发 RAG 应用程序，该程序可了解你的数据库模式。

计划培训

# The information schema query may need some tweaking depending on your database. This is a good starting point.
df_information_schema = vn.run_sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
# This will break up the information schema into bite-sized chunks that can be referenced by the LLM
plan = vn.get_training_plan_generic(df_information_schema)
plan
# If you like the plan, then uncomment this and run it to train
vn.train(plan=plan)

DDL 培训

# In duckDB the describe statement can fetch the DDL for any table
vn.train(ddl="DESCRIBE SELECT * FROM FineWeb")

问题/SQL 对培训

# here is an example of training on SQL statements
# In this data set we calculate the most common words in the text column
vn.train(sql ="""
SELECT word, COUNT(*) AS frequency
FROM (
    SELECT UNNEST(STRING_SPLIT(LOWER(text), ' ')) AS word
    FROM WIKIPEDIA
) AS words
GROUP BY word
ORDER BY frequency DESC
LIMIT 10;
""", question ="What are the most common words in text?")

文档培训

# We can use documentation to give explicit context that you would give to a data analyst
vn.train(documentation="The number of worker's column corresponds to people laid off")

聊天

你可以使用 Vanna 的提问功能提问或启动内置用户界面。

vn.ask("Show me the most common words in text", visualize=False)

你可以使用以下代码行启动 Flask 应用程序

from vanna.flask import VannaFlaskApp
app = VannaFlaskApp(vn)
app.run()

文章来源：https://arslanshahid-1997.medium.com/chat-with-150k-huggingface-datasets-using-vanna-ai-with-gpt-4o-547b69659f52

标签：

大型语言模型

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇法律案件搜索引擎：使用Qdrant、Llama 3 和 LangChain

下一篇使用LLM根据句子创建示例知识图谱

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来