数据集:

codeparrot/instructhumaneval

中文

Instruct HumanEval

Summary

InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use

The prompt can be built as follows, depending on the model's instruction tuning delimiters

from datasets import load_dataset
ds = load_dataset("codeparrot/instructhumaneval", split="test", use_auth_token=True)
prompt_0 = "Human\n" + ds[0]["instruction"] + "\nAssistant\n" + ds[0]["context"] 
print(prompt_0)

Output

Human:
Write a function has_close_elements(numbers: List[float], threshold: float) -> bool to solve the following problem:
Check if in given list of numbers, are any two numbers closer to each other than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True 
Assistant:
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:

The model can therefore complete the instruction and yield better results because it fits its training procedure.

You can also find the code to evaluate models on the dataset in the BigCode-evaluation-harness . The following sections provide more details on the dataset.

Dataset description

This dataset is a modified version of OpenAI HumanEval that is designed to adapt the benchmark to instruction fine-tuned models. As a matter of fact, HumanEval evaluates the ability to complete a code given its signature, its docstring and potentially some auxiliary functions.

Dataset construction

In order to build an instruction version of HumanEval we extracted relevant information from the prompt column of the original version

  • signature : this is the signature of the function to complete. It looks like def function_name(args:type):-> return_type .
  • docstring : this is the docstring of the function. It is the text which describes the purpose of the function.
  • context : this represents every additional information that is provided in order to help the model complete the function. It includes the imports and the auxiliary functions. Our idea was to move from the original format of HumanEval
<context>
<signature>
<docstring>

And build and instruction that would be

Write a function <signature> to solve the following problem:
<docstring>

From this instruction, we can design an evaluation pipeline for instruction fine-tuned languages models.

Evaluation

Instruction fine-tuned LLM are built by fine-tuning a base LLM on an instruction dataset. This instruction dataset contains several <input, output> pairs where each represent an instruction submitted by a user together with the right answer to it. These pairs are framed into a multi-turn conversation with the help of special tokens which design each member of the interaction e.g. Q user_token Human: , an assistant_token Assistant: and and end_token \n that designates the end of each turn.

Code completion

In this case, the LLM is provided with the following prompt

user_token + <instruction> + <end_token> + <assistant_token> + <context>

It is the expected to complete the function to solve the problem formulated by the instruction . It is very similar to the original evaluation with the advantage that it puts the model in the best condition to understand the task that it is asked to solve. The evaluation is done on the part generated after <assistant_token> .

Docstring to code

This setting is more complicated as it requires to model to account for the information contained in the instruction such as the function signature. The LLM is provided with the following prompt

user_token + <instruction> + <end_token> + <assistant_token>

The model has to generate a function with the correct signature that solve adequately the problem. The evaluation is done by identifying the content of the function in the generation (by search for the right entry_point / function_name ) and concatenating it with the <context> provided.

How to use the dataset

from datasets import load_dataset

ds = load_dataset("codeparrot/instructhumaneval")
ds
DatasetDict({
    test: Dataset({
        features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point', 'signature', 'docstring', 'context', 'instruction'],
        num_rows: 164
    })
})