StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Introduction

StudentEval is a dataset of 1,749 prompts for 48 problems, authored by 80 students who have only completed a one-semester Python programming class. To the best of our knowledge, it is the first dataset that has multiple prompts per problem and multiple attempts by the same participant . We identify four key disjoint subsets of StudentEval for each problem-participant pair:

First Success : the first attempt produced correct code.
First Failure : the first attempt did not work, and the participant moved on to the next problem.
Last Success : the final attempt produced correct code.
Last Failure : the final attempt did not work, and the participant moved on to the next problem.

During the experiment, we produced one completion per attempt. However, we can treat these prompts as a benchmark by repeatedly sampling completions to calculuate pass@k rates for each subset.

Example Items

For the total_bill problem, we showed students the following signature and input/output examples:

def total_bill(grocery_list, sales_tax):

Input	Output
[['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.07	15.44
[['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.0	14.43
[['bread', 2, 3.50]], 0.5	10.5

These are some examples of successful prompts in StudentEval:

The function takes in some number of lists consisting of a string, an integer, and a number, as well as one additional number "sales tax". The function multiplies the integer and the number in each list and adds them together for all the lists, giving a "total". Then the function multiplies the "total" by the "sales tax" and outputs that value added to the "total", truncated to two decimal places.

Add up the values of the product of the values stored in index 1 and 2 and round to the nearest hundredths if there are more than 2 decimal places.

you will have two inputs a list of lists and the tax rate. for every list in the list of lists multiply the second and third item and add all of them and then multiply that by the sales tax plus 1. if the resulting number has more than two decimal places shorten it to two decimal places.

And these are some examples of unsuccessful prompts:

This function takes in a list of the item purchased, the amount of the item purchased, the price for each item, and the overall sales tax. The amount purchased is multiplied with the price for each item, creating a total amount. The sales tax is then multiplied by the outcome of the total amount, and then the result of the multiplication is added onto the total price. The total price is then returned as the output.

takes a list of groceries and a value for sales tax. the list of groceries contains a list for every item. each item's list contains its name, quantity, and price, in that order. returns an float that is the sales tax times times the sum of all goods' quantity*price

Enter a list where the 0th index is a list of lists and the 1st index is an integer. for every list in the 0th index, for every list multiply the integers in the first and second indexes of each list in the list of lists. Add each product of each list. Then, multiply by 1 plus the integer in the first index of the input list

How to Benchmark an LLM with StudentEval

The code to run StudentEval is based on the BigCode Evaluation Harness .

Download our branch of the BigCode Evaluation Harness:

git clone https://github.com/arjunguha/bigcode-evaluation-harness/

Install its dependencies (see the README file).

Run the studenteval task. The following command evaluates SantaCoder:

python3 main.py --model bigcode/gpt_bigcode-santacoder --tasks studenteval --max_length_generation 512 --n_samples 20 --batch_size 20 --precision bf16 --allow_code_execution

and will produce output similar to the following:

Selected Tasks: ['studenteval']
Loading tokenizer and model (in bf16)
100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 519.93it/s]
100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 680.12it/s]
number of problems for this task is 1027
100%|__________________________________________________________________________________________________________________________________________________________________________| 1027/1027 [32:51<00:00,  1.92s/it]
generations were saved at generations.json
Evaluating generations...
100%|_______________________________________________________________________________________________________________________________________________________________________| 20540/20540 [01:21<00:00, 252.84it/s]
{
  "studenteval": [
    {
      "group": "First Failure",
      "pass1": 0.022333333333333334
    },
    {
      "group": "First Success",
      "pass1": 0.3195187165775401
    },
    {
      "group": "Last Failure",
      "pass1": 0.02195121951219512
    },
    {
      "group": "Last Success",
      "pass1": 0.21405405405405406
    }
  ],
  "config": {
    "model": "bigcode/gpt_bigcode-santacoder",
    "temperature": 0.2,
    "n_samples": 20
  }
}

This command uses 5GB VRAM on an Ampere-series GPU and takes roughly 30 minutes to generate completions and 10 minutes to execute them with 8 cores.

--precision bf_16 requires an Ampere or newer GPU. Remove it if needed, but it will require a little more memory.
We use --n_samples 200 in our paper, but --n_samples 20 is much faster very close to the results we report

Paper and Citation

Paper: https://arxiv.org/abs/2306.04556

@misc{studenteval,
      title={StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code}, 
      author={Hannah McLean Babe and Sydney Nguyen and Yangtian Zi and Arjun Guha and Molly Q Feldman and Carolyn Jane Anderson},
      year={2023},
}

作者:

wellesley-easel

数据集大小:

434.38 KB