数据集:
wellesley-easel/StudentEval
StudentEval是一个由80名学生编写的48个问题的1,749个提示的数据集,这些学生仅完成了一个学期的Python编程课程。据我们所知,这是第一个每个问题和参与者都有多个提示的数据集。我们为每个问题-参与者对标识了四个关键的不相交的子集:
在实验过程中,我们为每次尝试生成了一个完成。然而,我们可以将这些提示作为一个基准,通过重复抽样完成来计算每个子集的通过率@k。
对于total_bill问题,我们向学生显示了以下签名和输入/输出示例:
def total_bill(grocery_list, sales_tax):
Input | Output |
---|---|
[['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.07 | 15.44 |
[['apples', 6, 0.99],['milk', 1, 1.49],['bread', 2, 3.50]], 0.0 | 14.43 |
[['bread', 2, 3.50]], 0.5 | 10.5 |
以下是StudentEval中一些成功的提示示例:
以下是一些不成功的提示示例:
运行StudentEval的代码基于 BigCode Evaluation Harness 。
git clone https://github.com/arjunguha/bigcode-evaluation-harness/
python3 main.py --model bigcode/gpt_bigcode-santacoder --tasks studenteval --max_length_generation 512 --n_samples 20 --batch_size 20 --precision bf16 --allow_code_execution
并将产生类似以下的输出:
Selected Tasks: ['studenteval'] Loading tokenizer and model (in bf16) 100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 519.93it/s] 100%|_______________________________________________________________________________________________________________________________________________________________________________| 1/1 [00:00<00:00, 680.12it/s] number of problems for this task is 1027 100%|__________________________________________________________________________________________________________________________________________________________________________| 1027/1027 [32:51<00:00, 1.92s/it] generations were saved at generations.json Evaluating generations... 100%|_______________________________________________________________________________________________________________________________________________________________________| 20540/20540 [01:21<00:00, 252.84it/s] { "studenteval": [ { "group": "First Failure", "pass1": 0.022333333333333334 }, { "group": "First Success", "pass1": 0.3195187165775401 }, { "group": "Last Failure", "pass1": 0.02195121951219512 }, { "group": "Last Success", "pass1": 0.21405405405405406 } ], "config": { "model": "bigcode/gpt_bigcode-santacoder", "temperature": 0.2, "n_samples": 20 } }
此命令在Ampere系列GPU上使用了5GB VRAM,生成完成大约需要30分钟,使用8个核心执行完成需要大约10分钟。
论文: https://arxiv.org/abs/2306.04556
@misc{studenteval, title={StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code}, author={Hannah McLean Babe and Sydney Nguyen and Yangtian Zi and Arjun Guha and Molly Q Feldman and Carolyn Jane Anderson}, year={2023}, }