数据集:

code_x_glue_cc_clone_detection_big_clone_bench

语言:

code

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

c-uda
中文

Dataset Card for "code_x_glue_cc_clone_detection_big_clone_bench"

Dataset Summary

CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench

Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.

Supported Tasks and Leaderboards

  • semantic-similarity-classification : The dataset can be used to train a model for classifying if two given java methods are cloens of each other.

Languages

  • Java programming language

Dataset Structure

Data Instances

An example of 'test' looks as follows.

{
    "func1": "    @Test(expected = GadgetException.class)\n    public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n        HttpRequest request = createCacheableRequest();\n        expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n        replay(pipeline);\n        try {\n            specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n            fail(\"No exception thrown on bad parse\");\n        } catch (GadgetException e) {\n        }\n        specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n    }\n", 
    "func2": "    public InputStream getInputStream() throws TGBrowserException {\n        try {\n            if (!this.isFolder()) {\n                URL url = new URL(this.url);\n                InputStream stream = url.openStream();\n                return stream;\n            }\n        } catch (Throwable throwable) {\n            throw new TGBrowserException(throwable);\n        }\n        return null;\n    }\n", 
    "id": 0, 
    "id1": 2381663, 
    "id2": 4458076, 
    "label": false
}

Data Fields

In the following each data field in go is explained for each config. The data fields are the same among all splits.

default
field name type description
id int32 Index of the sample
id1 int32 The first function id
id2 int32 The second function id
func1 string The full text of the first function
func2 string The full text of the second function
label bool 1 is the functions are not equivalent, 0 otherwise

Data Splits

name train validation test
default 901028 415416 415416

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

Data was mined from the IJaDataset 2.0 dataset. [More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

Data was manually labeled by three judges by automatically identifying potential clones using search heuristics. [More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

Most of the clones are type 1 and 2 with type 3 and especially type 4 being rare.

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

https://github.com/microsoft , https://github.com/madlag

Licensing Information

Computational Use of Data Agreement (C-UDA) License.

Citation Information

@inproceedings{svajlenko2014towards,
  title={Towards a big data curated benchmark of inter-project code clones},
  author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
  booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
  pages={476--480},
  year={2014},
  organization={IEEE}
}

@inproceedings{wang2020detecting,
  title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
  author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
  booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={261--271},
  year={2020},
  organization={IEEE}
}

Contributions

Thanks to @madlag (and partly also @ncoop57) for adding this dataset.