CoNaLa Dataset for Code Generation

Table of content

Dataset Description
- Languages
Dataset Structure
- Data Instances
- Data Fields
- Data Splits

Dataset Descritpion

This dataset has been processed for Code Generation. CMU CoNaLa, the Code/Natural Language Challenge is a joint project of the Carnegie Mellon University NeuLab and STRUDEL Lab. This dataset was designed to test systems for generating program snippets from natural language. It is avilable at https://conala-corpus.github.io/ , and this is about 13k records from the full corpus of about 600k examples.

Languages

English

Dataset Structure

Data Instances

A sample from this dataset looks as follows:

[
  {
    "intent": "convert a list to a dictionary in python",
    "snippet": "b = dict(zip(a[0::2], a[1::2]))"
  },
  {
    "intent": "python - sort a list of nested lists",
    "snippet": "l.sort(key=sum_nested)"
  }
]

Dataset Fields

The dataset has the following fields (also called "features"):

{
  "intent": "Value(dtype='string', id=None)",
  "snippet": "Value(dtype='string', id=None)"
}

Dataset Splits

This dataset is split into a train, validation and test split. The split sizes are as follow:

Split name	Num samples
train	11125
valid	1237
test	500

作者:

AhmedSSoliman

数据集大小:

3.6 MB