数据集:
AhmedSSoliman/CoNaLa
This dataset has been processed for Code Generation. CMU CoNaLa, the Code/Natural Language Challenge is a joint project of the Carnegie Mellon University NeuLab and STRUDEL Lab. This dataset was designed to test systems for generating program snippets from natural language. It is avilable at https://conala-corpus.github.io/ , and this is about 13k records from the full corpus of about 600k examples.
English
A sample from this dataset looks as follows:
[ { "intent": "convert a list to a dictionary in python", "snippet": "b = dict(zip(a[0::2], a[1::2]))" }, { "intent": "python - sort a list of nested lists", "snippet": "l.sort(key=sum_nested)" } ]
The dataset has the following fields (also called "features"):
{ "intent": "Value(dtype='string', id=None)", "snippet": "Value(dtype='string', id=None)" }
This dataset is split into a train, validation and test split. The split sizes are as follow:
Split name | Num samples |
---|---|
train | 11125 |
valid | 1237 |
test | 500 |