数据集:

codeparrot/apps

任务:

文本生成

子任务:

language-modeling

语言:

code

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

crowdsourced expert-generated

预印本库:

arxiv:2105.09938 arxiv:2203.07814

许可:

mit

数据集介绍文件清单

英文

APPS 数据集

数据集描述

APPS 是一个包含 10000 个问题的代码生成基准。它可以用于评估语言模型根据自然语言规范生成代码的能力。您可以在这里的中心找到 APPS 指标 codeparrot/apps_metric 。

语言

数据集中包含英语的问题和 Python 的代码解决方案。

数据集结构

from datasets import load_dataset
load_dataset("codeparrot/apps")

DatasetDict({
    train: Dataset({
        features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'],
        num_rows: 5000
    })
})

如何使用

您可以使用以下两行代码加载并迭代训练拆分的数据集：

from datasets import load_dataset
import json

ds = load_dataset("codeparrot/apps", split="train")
sample = next(iter(ds))
# non-empty solutions and input_output features can be parsed from text format this way:
sample["solutions"] = json.loads(sample["solutions"])
sample["input_output"] = json.loads(sample["input_output"])
print(sample)

#OUTPUT:
{
 'problem_id': 0,
 'question': 'Polycarp has $n$ different binary words. A word called binary if it contains only characters \'0\' and \'1\'. For example...',
 'solutions': ["for _ in range(int(input())):\n    n = int(input())\n    mass = []\n    zo = 0\n    oz = 0\n    zz = 0\n    oo = 0\n...",...],
 'input_output': {'inputs': ['4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n'], 
                  'outputs': ['1\n3 \n-1\n0\n\n2\n1 2 \n']},
 'difficulty': 'interview',
 'url': 'https://codeforces.com/problemset/problem/1259/D',
 'starter_code': ''}
}

每个样本包含一个用英语表示的编程问题陈述，一些真实的 Python 解决方案，以及根据输入和输出定义的测试用例和（如果提供）功能名称，以及有关问题难度和来源的一些元数据。

如果样本具有非空的 input_output 特征，您可以将其读取为带有输入和输出键以及 fn_name（如果存在）的字典，并且您可以将解决方案解析为解决方案列表，如上面的代码所示。

您还可以根据难度级别过滤数据集：入门、面试和比赛。只需将困难程度列表传递给过滤器。例如，如果您想要最具挑战性的问题，您需要选择比赛级别：

ds = load_dataset("codeparrot/apps", split="train", difficulties=["competition"])
print(next(iter(ds))["question"])

#OUTPUT:
"""\
Codefortia is a small island country located somewhere in the West Pacific. It consists of $n$ settlements connected by
...

For each settlement $p = 1, 2, \dots, n$, can you tell what is the minimum time required to travel between the king's residence and the parliament house (located in settlement $p$) after some roads are abandoned?

-----Input-----

The first line of the input contains four integers $n$, $m$, $a$ and $b$ 
...

-----Output-----

Output a single line containing $n$ integers
...

-----Examples-----
Input
5 5 20 25
1 2 25
...

Output
0 25 60 40 20
...

数据字段

Field	Type	Description
problem_id	int	problem id
question	string	problem description
solutions	string	some python solutions
input_output	string	Json string with "inputs" and "outputs" of the test cases, might also include "fn_name" the name of the function
difficulty	string	difficulty level of the problem
url	string	url of the source of the problem
starter_code	string	starter code to include in prompts

我们提到只有少数样本指定了 fn_name 和 starter_code

数据拆分

数据集包含5000个样本的训练集和测试集拆分。

数据集统计信息

10000 个编码问题
131777 个测试用例
所有问题都有至少一个测试用例，除了训练拆分中的 195 个样本
对于测试拆分，平均测试用例数量为 21.2
问题的平均长度为 293.2 个单词
所有文件都有真实解决方案，除了测试拆分中的 1235 个样本

数据集创建

为了创建 APPS 数据集，作者从程序员互相分享问题的开放获取网站（包括 Codewars、AtCoder、Kattis 和 Codeforces）手动筛选了问题。有关更多详细信息，请参阅原始文献 paper 。

使用数据时的注意事项

在 AlphaCode 中作者发现，由于测试覆盖不足，该数据集可能会生成许多误报为正确的错误提交。

引用信息

@article{hendrycksapps2021,
  title={Measuring Coding Challenge Competence With APPS},
  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},
  journal={NeurIPS},
  year={2021}
}

作者:

codeparrot

数据集大小:

1.3 GB