数据集:
wellesley-easel/StudentEval
StudentEval is a dataset of 1,749 prompts for 48 problems, authored by 80 students who have only completed a one-semester Python programming class. To the best of our knowledge, it is the first dataset that has multiple prompts per problem and multiple attempts by the same participant . We identify four key disjoint subsets of StudentEval for each problem-participant pair:
During the experiment, we produced one completion per attempt. However, we can treat these prompts as a benchmark by repeatedly sampling completions to calculuate pass@k rates for each subset.
For the total_bill problem, we showed students the following signature and input/output examples:
def total_bill(grocery_list, sales_tax):
| Input | Output |
|---|---|
| [['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.07 | 15.44 |
| [['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.0 | 14.43 |
| [['bread', 2, 3.50]], 0.5 | 10.5 |
These are some examples of successful prompts in StudentEval:
And these are some examples of unsuccessful prompts:
The code to run StudentEval is based on the BigCode Evaluation Harness .
Download our branch of the BigCode Evaluation Harness:
git clone https://github.com/arjunguha/bigcode-evaluation-harness/
Install its dependencies (see the README file).
Run the studenteval task. The following command evaluates SantaCoder:
python3 main.py --model bigcode/gpt_bigcode-santacoder --tasks studenteval --max_length_generation 512 --n_samples 20 --batch_size 20 --precision bf16 --allow_code_execution
and will produce output similar to the following:
Selected Tasks: ['studenteval']
Loading tokenizer and model (in bf16)
100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 519.93it/s]
100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 680.12it/s]
number of problems for this task is 1027
100%|__________________________________________________________________________________________________________________________________________________________________________| 1027/1027 [32:51<00:00, 1.92s/it]
generations were saved at generations.json
Evaluating generations...
100%|_______________________________________________________________________________________________________________________________________________________________________| 20540/20540 [01:21<00:00, 252.84it/s]
{
"studenteval": [
{
"group": "First Failure",
"pass1": 0.022333333333333334
},
{
"group": "First Success",
"pass1": 0.3195187165775401
},
{
"group": "Last Failure",
"pass1": 0.02195121951219512
},
{
"group": "Last Success",
"pass1": 0.21405405405405406
}
],
"config": {
"model": "bigcode/gpt_bigcode-santacoder",
"temperature": 0.2,
"n_samples": 20
}
}
This command uses 5GB VRAM on an Ampere-series GPU and takes roughly 30 minutes to generate completions and 10 minutes to execute them with 8 cores.
Paper: https://arxiv.org/abs/2306.04556
@misc{studenteval,
title={StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code},
author={Hannah McLean Babe and Sydney Nguyen and Yangtian Zi and Arjun Guha and Molly Q Feldman and Carolyn Jane Anderson},
year={2023},
}