数据集:
wellesley-easel/StudentEval
StudentEval是一个由80名学生编写的48个问题的1,749个提示的数据集,这些学生仅完成了一个学期的Python编程课程。据我们所知,这是第一个每个问题和参与者都有多个提示的数据集。我们为每个问题-参与者对标识了四个关键的不相交的子集:
在实验过程中,我们为每次尝试生成了一个完成。然而,我们可以将这些提示作为一个基准,通过重复抽样完成来计算每个子集的通过率@k。
对于total_bill问题,我们向学生显示了以下签名和输入/输出示例:
def total_bill(grocery_list, sales_tax):
| Input | Output |
|---|---|
| [['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.07 | 15.44 |
| [['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.0 | 14.43 |
| [['bread', 2, 3.50]], 0.5 | 10.5 |
以下是StudentEval中一些成功的提示示例:
以下是一些不成功的提示示例:
运行StudentEval的代码基于 BigCode Evaluation Harness 。
git clone https://github.com/arjunguha/bigcode-evaluation-harness/
python3 main.py --model bigcode/gpt_bigcode-santacoder --tasks studenteval --max_length_generation 512 --n_samples 20 --batch_size 20 --precision bf16 --allow_code_execution
并将产生类似以下的输出:
Selected Tasks: ['studenteval']
Loading tokenizer and model (in bf16)
100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 519.93it/s]
100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 680.12it/s]
number of problems for this task is 1027
100%|__________________________________________________________________________________________________________________________________________________________________________| 1027/1027 [32:51<00:00, 1.92s/it]
generations were saved at generations.json
Evaluating generations...
100%|_______________________________________________________________________________________________________________________________________________________________________| 20540/20540 [01:21<00:00, 252.84it/s]
{
"studenteval": [
{
"group": "First Failure",
"pass1": 0.022333333333333334
},
{
"group": "First Success",
"pass1": 0.3195187165775401
},
{
"group": "Last Failure",
"pass1": 0.02195121951219512
},
{
"group": "Last Success",
"pass1": 0.21405405405405406
}
],
"config": {
"model": "bigcode/gpt_bigcode-santacoder",
"temperature": 0.2,
"n_samples": 20
}
}
此命令在Ampere系列GPU上使用了5GB VRAM,生成完成大约需要30分钟,使用8个核心执行完成需要大约10分钟。
论文: https://arxiv.org/abs/2306.04556
@misc{studenteval,
title={StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code},
author={Hannah McLean Babe and Sydney Nguyen and Yangtian Zi and Arjun Guha and Molly Q Feldman and Carolyn Jane Anderson},
year={2023},
}