数据集:

koutch/staqc

任务:

问答

语言:

en

大小:

10K<n<100K

预印本库:

arxiv:1803.09371

其他:

code

许可:

cc-by-4.0
中文

Dataset Card for StaQC (A Systematically Mined Question-Code Dataset from Stack Overflow)

Dataset Summary

StaQC (Stack Overflow Question-Code pairs) is a large dataset of around 148K Python and 120K SQL domain question-code pairs, which are automatically mined from Stack Overflow using a Bi-View Hierarchical Neural Network. StaQC is collected from three sources: multi-code answer posts, single-code answer posts, and manual annotations on multi-code answer posts.

The dataset was originally released by the main authors on GitHub . This version is a non-modified redistributed copy (under the license permission) made available on the hub for easier access.

Standalone solutions

As noted in the paper, the authors define a code snippet as a code solution when the questioner can solve the problem solely based on it (also named as “standalone” solution).

Manual annotations

The manual annotations are the collection of multi-code answer posts for which each code snippet was annotated with a boolean indicating whether or not the snippet is a standalone solution to the question.

Multi-code answer posts

A Multi-code answer post is an (accepted) answer post that contains multiple code snippets, some of which may not be a standalone code solution to the question (see Section 1 in paper ). For example, in this multi-code answer post , the third code snippet is not a code solution to the question "How to limit a number to be within a specified range? (Python)".

Note: the multi-code answer posts contain also the manual annotations.

Single-code answer posts

A Single-code answer post is an (accepted) answer post that contains only one code snippet. We pair such code snippets with the question title as a question-code pair.

Supported Tasks and Leaderboards

This dataset can be used for Natural Language to Code Generation tasks.

Languages

Python, SQL, English

Dataset Structure

Data Instances

Each configuration correspond to one of the three parts, in a given programming language.

There are three parts for the dataset:

  • mca (Multi-code answer posts)
  • sca (Single-code answer posts)
  • man (Manual annotations)

And two programming/query languages:

  • python
  • sql

One can obtain obtain a configuration as a combination of a part in a programing language. For instance, one can obtain the automatically mined multi-code answers in python using:

dataset = load_dataset("koutch/staqc", 'mca_python')
DatasetDict({
    train: Dataset({
        features: ['id', 'question_id', 'question', 'snippet'],
        num_rows: 40391
    })
})

or the manual annotations using:

dataset = load_dataset("koutch/staqc", 'man_sql')
DatasetDict({
    train: Dataset({
        features: ['id', 'question_id', 'question', 'snippet'],
        num_rows: 1587
    })
})
Manual annotations

The manual annotations contain, for a given stackoverflow questions, for each individual code block in the accepted answer of that post, information on whether or not the given code block is a standalone solution to the question asked (the question title).

{
  'question_id': 5947137,
  'question': 'How can I use a list comprehension to extend a list in python?',
  'snippet': {'text': ['import itertools as it\n\nreturn sum(it.imap(doSomething, originalList), [])\n',
   'return sum(map(doSomething, originalList), [])\n',
   'return sum((doSomething(x) for x in originalList), [])\n',
   'accumulationList = []\nfor x in originalList:\n    accumulationList.extend(doSomething(x))\nreturn accumulationList\n'],
  'is_sda': [True, True, True, True]}
}
Multi-code answer posts
{
  'question_id': 35349290,
  'question': 'Python: Generating YYMM string between two dates',
  'snippet': ['start_year = 2005\nend_year = 2007\nstart_month = 3\nend_month = 2\nyymm = [(yy, mm) for yy in range(start_year, end_year + 1) for mm in range(1, 13)\n        if (start_year, start_month) <= (yy, mm) <= (end_year, end_month)]\n',
  "formatted_yymm = ['{:>02}{:>02}.mat'.format(yy % 100, mm) for yy, mm in yymm]\n"]
  }
Single-code answer posts
{
 'question_id': 19387200,
 'question': 'Python: get OS language',
 'snippet': "import locale\nloc = locale.getlocale() # get current locale\nlocale.getdefaultlocale() # Tries to determine the default locale settings and returns them as a tuple of the form (language code, encoding); e.g, ('en_US', 'UTF-8').\n"
}

Data Fields

  • question_id : id of the stackoverflow question
  • question : title of the stackoverflow question repurposed as the natural language intent
  • snippet : mined or annotated standalone solution(s) (potentially) answerring the question
  • is_sda : for the manual annotations, whether or not the given code snippet is a standalone solution to the question.

Data Splits

Each configuration of the dataset contains only a training split.

Dataset Creation

Source Data

StackOverflow data dump.

Annotations

See section 2.3 "Annotating QC Pairs for Model Training" of the paper

Additional Information

Licensing Information

This work is licensed under a Creative Commons Attribution 4.0 International License .

Citation Information

If you use the dataset or the code in your research, please cite the following paper:

@inproceedings{yao2018staqc,
  title={StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow},
  author={Yao, Ziyu and Weld, Daniel S and Chen, Wei-Peng and Sun, Huan},
  booktitle={Proceedings of the 2018 World Wide Web Conference on World Wide Web},
  pages={1693--1703},
  year={2018},
  organization={International World Wide Web Conferences Steering Committee}
}

Contributions information

I did not contribute to the creation of this dataset, only to the redistribution. All credits should be attributed to the original authors.