数据集:

neulab/conala

任务:

文生文

语言:

code

计算机处理:

monolingual

源数据集:

original

预印本库:

arxiv:1805.08949

许可:

mit
英文

数据集概述

CoNaLa 是一个用于评估代码生成任务的代码和自然语言对齐的基准数据集。该数据集从Stack Overflow上爬取而来,经过自动过滤,再由注释者进行筛选和整理,分为2,379个训练示例和500个测试示例。此外,还提供了自动采集的数据集,其中包含近60万个示例。

支持的任务和排行榜

该数据集用于评估代码生成任务。

语言

英语 - Python代码。

数据集结构

dataset_curated = load_dataset("neulab/conala")
DatasetDict({
    train: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 2379
    })
    test: Dataset({
        features: ['question_id', 'intent', 'rewritten_intent', 'snippet'],
        num_rows: 500
    })
})

dataset_mined = load_dataset("neulab/conala", "mined")
DatasetDict({
    train: Dataset({
        features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'id'],
        num_rows: 593891
    })
})

数据实例

CoNaLa - 经过整理的

这是注释者整理过的数据集。

{
    'question_id': 41067960,
    'intent': 'How to convert a list of multiple integers into a single integer?',
    'rewritten_intent': "Concatenate elements of a list 'x' of multiple integers to a single integer",
    'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))'
}
CoNaLa - 经过采集的

这是经过自动采集的数据集,尚未经过整理。

{
    'question_id': 34705205,
     'parent_answer_post_id': 34705233,
     'prob': 0.8690001442846342,
     'snippet': 'sorted(l, key=lambda x: (-int(x[1]), x[0]))',
     'intent': 'Sort a nested list by two elements',
     'id': '34705205_34705233_0'
}

数据字段

整理过的数据:

Field Type Description
question_id int64 Id of the Stack Overflow question
intent string Natural Language intent (i.e., the title of a Stack Overflow question)
rewritten_intent string Crowdsourced revised intents that try to better reflect the full meaning of the code
snippet string Code snippet that implements the intent

采集过的数据:

Field Type Description
question_id int64 Id of the Stack Overflow question
parent_answer_post_id int64 Id of the answer post from which the candidate snippet is extracted
intent string Natural Language intent (i.e., the title of a Stack Overflow question)
snippet string Code snippet that implements the intent
id string Unique id for this intent/snippet pair
prob float64 Probability given by the mining model

数据拆分

该数据集有两个版本(经过整理和经过采集),经过采集的数据集只有一个训练集,经过整理的数据集有两个拆分:训练集和测试集。

数据集创建

该数据集从Stack Overflow上爬取而来,经过自动过滤,然后由注释者进行整理。更多详细信息,请参阅原始 paper

引用信息

@inproceedings{yin2018learning,
  title={Learning to mine aligned code and natural language pairs from stack overflow},
  author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham},
  booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)},
  pages={476--486},
  year={2018},
  organization={IEEE}
}