数据集:
neulab/conala
任务:
文生文语言:
code计算机处理:
monolingual源数据集:
original预印本库:
arxiv:1805.08949其他:
code-generation许可:
mitCoNaLa 是一个用于评估代码生成任务的代码和自然语言对齐的基准数据集。该数据集从Stack Overflow上爬取而来,经过自动过滤,再由注释者进行筛选和整理,分为2,379个训练示例和500个测试示例。此外,还提供了自动采集的数据集,其中包含近60万个示例。
该数据集用于评估代码生成任务。
英语 - Python代码。
dataset_curated = load_dataset("neulab/conala") DatasetDict({ train: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 2379 }) test: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 500 }) }) dataset_mined = load_dataset("neulab/conala", "mined") DatasetDict({ train: Dataset({ features: ['question_id', 'parent_answer_post_id', 'prob', 'snippet', 'intent', 'id'], num_rows: 593891 }) })
这是注释者整理过的数据集。
{ 'question_id': 41067960, 'intent': 'How to convert a list of multiple integers into a single integer?', 'rewritten_intent': "Concatenate elements of a list 'x' of multiple integers to a single integer", 'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))' }CoNaLa - 经过采集的
这是经过自动采集的数据集,尚未经过整理。
{ 'question_id': 34705205, 'parent_answer_post_id': 34705233, 'prob': 0.8690001442846342, 'snippet': 'sorted(l, key=lambda x: (-int(x[1]), x[0]))', 'intent': 'Sort a nested list by two elements', 'id': '34705205_34705233_0' }
整理过的数据:
Field | Type | Description |
---|---|---|
question_id | int64 | Id of the Stack Overflow question |
intent | string | Natural Language intent (i.e., the title of a Stack Overflow question) |
rewritten_intent | string | Crowdsourced revised intents that try to better reflect the full meaning of the code |
snippet | string | Code snippet that implements the intent |
采集过的数据:
Field | Type | Description |
---|---|---|
question_id | int64 | Id of the Stack Overflow question |
parent_answer_post_id | int64 | Id of the answer post from which the candidate snippet is extracted |
intent | string | Natural Language intent (i.e., the title of a Stack Overflow question) |
snippet | string | Code snippet that implements the intent |
id | string | Unique id for this intent/snippet pair |
prob | float64 | Probability given by the mining model |
该数据集有两个版本(经过整理和经过采集),经过采集的数据集只有一个训练集,经过整理的数据集有两个拆分:训练集和测试集。
该数据集从Stack Overflow上爬取而来,经过自动过滤,然后由注释者进行整理。更多详细信息,请参阅原始 paper 。
@inproceedings{yin2018learning, title={Learning to mine aligned code and natural language pairs from stack overflow}, author={Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham}, booktitle={2018 IEEE/ACM 15th international conference on mining software repositories (MSR)}, pages={476--486}, year={2018}, organization={IEEE} }