CodeXGLUE克隆检测-BigCloneBench数据集,可在 https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench 处获取
给定两个代码作为输入,任务是进行二进制分类(0/1),其中1表示语义等效,0表示其他。模型通过F1得分进行评估。我们使用的数据集是BigCloneBench,并按照文章“使用图神经网络和流增强抽象语法树检测代码克隆”进行了过滤。
"test"的一个示例如下。
{ "func1": " @Test(expected = GadgetException.class)\n public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n HttpRequest request = createCacheableRequest();\n expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n replay(pipeline);\n try {\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n fail(\"No exception thrown on bad parse\");\n } catch (GadgetException e) {\n }\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n }\n", "func2": " public InputStream getInputStream() throws TGBrowserException {\n try {\n if (!this.isFolder()) {\n URL url = new URL(this.url);\n InputStream stream = url.openStream();\n return stream;\n }\n } catch (Throwable throwable) {\n throw new TGBrowserException(throwable);\n }\n return null;\n }\n", "id": 0, "id1": 2381663, "id2": 4458076, "label": false }
下面解释了每个配置的每个数据字段。所有拆分之间的数据字段是相同的。
defaultfield name | type | description |
---|---|---|
id | int32 | Index of the sample |
id1 | int32 | The first function id |
id2 | int32 | The second function id |
func1 | string | The full text of the first function |
func2 | string | The full text of the second function |
label | bool | 1 is the functions are not equivalent, 0 otherwise |
name | train | validation | test |
---|---|---|---|
default | 901028 | 415416 | 415416 |
[需要更多信息]
数据是从IJaDataset 2.0数据集中挖掘得到的。[需要更多信息]
谁是源语言的生产者?[需要更多信息]
通过使用搜索启发式方法自动识别潜在的克隆,并由三名评委进行手动标注。[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
大多数克隆属于类型1和2,类型3和尤其是类型4的克隆很少。
[需要更多信息]
[需要更多信息]
https://github.com/microsoft , https://github.com/madlag
计算数据使用协议(C-UDA)许可。
@inproceedings{svajlenko2014towards, title={Towards a big data curated benchmark of inter-project code clones}, author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun}, booktitle={2014 IEEE International Conference on Software Maintenance and Evolution}, pages={476--480}, year={2014}, organization={IEEE} } @inproceedings{wang2020detecting, title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree}, author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi}, booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)}, pages={261--271}, year={2020}, organization={IEEE} }
感谢@madlag(部分也感谢@ncoop57)添加了这个数据集。