数据集:

tjaffri/wikisql-generate

中文

WikiSQL Dataset (Reformatted for Generative Models)

This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql , but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place.

Specifically, the changes from standard WikiSQL are:

  • The table details in WikiSQL were included as dictionaries but tools like LangChain and LlamaIndex build their prompts using a SQL DESCRIBE of the tables, which is included in this dataset as the table_info.

  • In addition, some of the SQL commands in WikiSQL that were not syntactically valid (e.g. due to identifiers not quoted) were removed. Specifically, we created in-memory (SQLite) tables using the SQL DESCRIBE of the tables, then ran the WikiSQL human readable SQL query against these in-memory tables. Any SQL queries that threw exceptions for any reason were discarded, and the rest that ran without exceptions were included in this dataset as the sql_cmd.

  • The SQL queries under sql_cmd were also formatted to capitalize keywords and do other pretty printing of the SQL using SQLParse to make the SQL more standard and easier to learn for smaller models.

  • Suggested Uses

    This dataset may be used for the following purposes:

  • Combine SQL queries with text based retrieval, using techniques like the LlamaIndex SQLAutoVectorQueryEngine .

  • Fine tuning LLMs to generate SQL commands from natural language inputs, given SQL DESCRIBE of tables and various rows. This is exactly the use case for the LangChain SQLChain, so once fine tuned these LLMs may be used directly with these chains for theoretically better results (not tried at the time of writing)

  • Few shot prompt seeding of LLMs used to generate SQL commands from natural language inputs.