数据集:

codeparrot/conala-mined-curated

数字对象标识符:

10.57967/hf/0755
中文

Conala-mined-curated

Conala-mined-curatedd is a dataset that is based on the mined subset of the CoNaLa dataset . conala is a dataset crawled from Stack Overflow. Part of it is filtered and curated to from a training set and a test set. However, the mined part is not comparably post-processed. It is a set of 600K examples that we decided to work on.

Dataset description

The conala datasets have 3 columns of interest. We give their description as provided by the authors

  • intent : Natural Language intent (i.e., the title of a Stack Overflow question)
  • snippet : A code snippet that implements the intent. This is the output of systems in the challenge.
  • rewritten_intent : Crowdsourced revised intents that try to better reflect the full meaning of the code, typically done by incorporating variable names and
  • function arguments that appeared in the code into the intent. This is the input to be used by systems in the CoNaLa challenge.

For instruction fine-tuning, we would be interested to train a model to map the rewritten_intent to the snippet . However, the mined subset does not have the column rewritten_intent . intent is to vague to be describe as an instruction so we have to find a way to build the column rewritten_intent for the mined subset. That is exactly what was done in order to build this dataset.

Method

The most valuable information that we have in order to recover the column rewritten_intent are the columns intent and snippet . Fortunately we also have the training set and the test set of conala which are labeled. It means that we have a view of what a high quality triplet ( intent , rewritten_intent , snippet ) look like. We had the idea to build a Seq2Seq model whose role would be to reconstruct the rewritten_intent based on the concatenation [ intent , snippet ].

More precisely, we fine-tuned google UL2 to solve this task.

Usage

from datasets import load_dataset
dataset = load_dataset("codeparrot/conala-mined-curated")

dataset
DatasetDict({
    train: Dataset({
        features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'rewritten_intent', 'id'],
        num_rows: 593891
    })
})

Additional resources