数据集:

koutch/intro_prog

中文

Dataset Card for intro_prog

Dataset Description

Dataset Summary

IntroProg is a collection of students' submissions to assignments in various introductory programming courses offered at different universities. Currently, the dataset contains submissions collected from Dublin City University, and the University of Singapore.

Dublin

The Dublin programming dataset is a dataset composed of students' submissions to introductory programming assignments at the University of Dublin. Students submitted these programs for multiple programming courses over the duration of three academic years.

Singapore

The Singapore dataset contains 2442 correct and 1783 buggy program attempts by 361 undergraduate students crediting an introduction to Python programming course at NUS (National University of Singapore).

Supported Tasks and Leaderboards

"Metadata": Program synthesis

Similarly to the Most Basic Python Programs (mbpp), the data split can be used to evaluate code generations models.

"Data"

The data configuration contains all the submissions as well as an indicator of whether these passed the required test.

"repair": Program refinement/repair

The "repair" configuration of each dataset is a subset of the "data" configuration augmented with educators' annotations on the corrections to the buggy programs. This configuration can be used for the task of program refinement. In Computing Education Research (CER), methods for automatically repairing student programs are used to provide students with feedback and help them debug their code.

"bug": Bug classification

[Coming soon]

Languages

The assignments were written in Python.

Dataset Structure

One configuration is defined by one source dataset dublin or singapore and one subconfiguration ("metadata", "data", or "repair"):

  • "dublin_metadata"
  • "dublin_data"
  • "dublin_repair"
  • "singapore_metadata"
  • "singapore_data"
  • "singapore_repair"

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Some of the fields are configuration specific

  • submission_id: a unique number identifying the submission
  • user: a unique string identifying the (anonymized) student who submitted the solution
  • date: the timestamp at which the grading server received the submission
  • func_code: the cleaned code submitted
  • func_name: the name of the function that had to be implemented
  • assingment_id: the unique (string) identifier of the assignment that had to be completed
  • academic_year: the starting year of the academic year (e.g. 2015 for the academic year 2015-2016)
  • module: the course/module
  • test: a human eval-style string which can be used to execute the submitted solution on the provided test cases
  • Description: a description of what the function is supposed to achieve
  • correct: whether the solution passed all tests or not

Data Splits

Dublin

The Dublin dataset is split into a training and validation set. The training set contains the submissions to the assignments written during the academic years 2015-2016, and 2016-2017, while the test set contains programs written during the academic year 2017-2018.

Singapore

The Singapore dataset only contains a training split, which can be used as a test split for evaluating how your feedback methods perform on an unseen dataset (if, for instance, you train your methods on the Dublin Dataset).

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Dublin Singapore

The data was released under a GNU Lesser General Public License v3.0 license

Citation Information

@inproceedings{azcona2019user2code2vec,
  title={user2code2vec: Embeddings for Profiling Students Based on Distributional Representations of Source Code},
  author={Azcona, David and Arora, Piyush and Hsiao, I-Han and Smeaton, Alan},
  booktitle={Proceedings of the 9th International Learning Analytics & Knowledge Conference (LAK’19)},
  year={2019},
  organization={ACM}
}
@inproceedings{DBLP:conf/edm/CleuziouF21,
  author    = {Guillaume Cleuziou and
               Fr{\'{e}}d{\'{e}}ric Flouvat},
  editor    = {Sharon I{-}Han Hsiao and
               Shaghayegh (Sherry) Sahebi and
               Fran{\c{c}}ois Bouchet and
               Jill{-}J{\^{e}}nn Vie},
  title     = {Learning student program embeddings using abstract execution traces},
  booktitle = {Proceedings of the 14th International Conference on Educational Data
               Mining, {EDM} 2021, virtual, June 29 - July 2, 2021},
  publisher = {International Educational Data Mining Society},
  year      = {2021},
  timestamp = {Wed, 09 Mar 2022 16:47:22 +0100},
  biburl    = {https://dblp.org/rec/conf/edm/CleuziouF21.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contributions

[More Information Needed]