数据集:

unaidedelf87777/openapi-function-invocations-25k

中文

Announcement

We are excited to announce that we are currently fine-tuning MPT-7b and MPT-30b, advanced AI models originally trained by MosaicML. We are leveraging our AI Function Invocation Dataset to enhance their programming capabilities and their ability to interpret and respond accurately to natural language prompts for function invocation.

These fine-tuned versions of MPT-7b and MPT-30b are set to be released soon. They are being tailored to provide more profound and intuitive interactions for AI-assisted programming tasks.

In addition, we have prepared JSONL files that are formatted for use with the MPT-30b, MPT-7b, and GPT-Neo-X tokenizers. Located in the data folder, these train and test sets are designed to assist you in your machine learning experiments and model evaluations.

We are eager to see the innovative ways in which you will use these resources and look forward to your feedback. Stay tuned for more updates!

AI Function Invocation Dataset

Welcome to our AI Function Invocation Dataset, a synthetically constructed collection designed to teach AI models how to correctly invoke functions based on natural language prompts. This dataset is particularly effective when used with models that already possess a solid understanding of programming concepts and principles.

Dataset Construction

The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis. The function definitions used in this dataset were derived from OpenAPI (formerly Swagger) API specifications. OpenAPI is a specification for machine-readable interface files for describing, producing, consuming, and visualizing RESTful web services. We procured the API specifications from APIsGuru , an open-source project that collects and shares machine-readable API definitions from around the web.

Following the extraction of the function definitions, we leveraged the GPT-3.5-turbo model from OpenAI to generate a series of responses based on the given function definitions. The generation process was guided by a predefined prompt, which you can find here . This prompt served to instruct the AI on the manner and structure of its responses.

The model's responses were then systematically formatted and included in the dataset as individual entries. Each entry consists of the function definition used, the generated natural language prompt, the function call inferred from the prompt, an example function response, and the model's anticipated reply based on this response.

Structure of the Dataset

Each entry in the dataset is a synthetic JSON object, with the following components:

  • function_definition_used : A reprint of the original function definition that the model used as a reference for generating the other components. This contains details about the function's name, description, and parameters.
  • Prompt_to_call_function : A natural language request designed by the model to imply the use of a specific function.
  • Function_call_from_model : The function call generated by the model based on the prompt.
  • function_response : A simulated API response based on the arguments from the function call.
  • message_from_model_based_on_function_response : The model's expected response to the user, based on the function response.

The dataset is formatted as a CSV file, and each line represents a unique function invocation scenario. The dataset consists of 25,000 examples.

Limitations

This dataset is designed to teach AI models how to correctly invoke functions based on natural language prompts. It does not guide the model on when or why to invoke a function, or on advanced scenarios like invoking multiple functions within a single prompt. It's important to consider these limitations when using this dataset.

How to Load the Dataset

To load the dataset, you can use the Hugging Face Datasets library. Here's a simple guide on how to load the dataset using this library.

Firstly, install the Hugging Face Datasets library with pip:

pip install datasets

Then, use the load_dataset function to load the dataset.

from datasets import load_dataset

dataset = load_dataset('unaidedelf87777/openapi-function-invocation-25k')

You can then access the dataset through the dataset variable.

Citation

If you find this dataset useful in your research, please cite it as follows:

@misc{ai_function_invocation,
  author = {unaidedelf87777},
  title = {AI Function Invocation Dataset},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocation-25k}},
  note = {Function definitions were extracted from OpenAPI specs provided by APIsGuru (https://github.com/APIs-guru/openapi-directory.git)}
}

We hope you find this dataset useful for your projects and research! Feel free to reach out with any questions or feedback.