数据集:

code_x_glue_cc_clone_detection_big_clone_bench

任务:

文本分类

子任务:

semantic-similarity-classification

语言:

code

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

c-uda

数据集介绍文件清单

英文

"code_x_glue_cc_clone_detection_big_clone_bench"的数据集卡片

数据集概述

CodeXGLUE克隆检测-BigCloneBench数据集，可在 https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench 处获取

给定两个代码作为输入，任务是进行二进制分类（0/1），其中1表示语义等效，0表示其他。模型通过F1得分进行评估。我们使用的数据集是BigCloneBench，并按照文章“使用图神经网络和流增强抽象语法树检测代码克隆”进行了过滤。

支持的任务和排行榜

语义相似性分类：该数据集可用于训练一个模型，用于分类是否两个给定的Java方法是彼此的克隆。

语言

Java编程语言

数据集结构

数据实例

"test"的一个示例如下。

{
    "func1": "    @Test(expected = GadgetException.class)\n    public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n        HttpRequest request = createCacheableRequest();\n        expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n        replay(pipeline);\n        try {\n            specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n            fail(\"No exception thrown on bad parse\");\n        } catch (GadgetException e) {\n        }\n        specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n    }\n", 
    "func2": "    public InputStream getInputStream() throws TGBrowserException {\n        try {\n            if (!this.isFolder()) {\n                URL url = new URL(this.url);\n                InputStream stream = url.openStream();\n                return stream;\n            }\n        } catch (Throwable throwable) {\n            throw new TGBrowserException(throwable);\n        }\n        return null;\n    }\n", 
    "id": 0, 
    "id1": 2381663, 
    "id2": 4458076, 
    "label": false
}

数据字段

下面解释了每个配置的每个数据字段。所有拆分之间的数据字段是相同的。

default

field name	type	description
id	int32	Index of the sample
id1	int32	The first function id
id2	int32	The second function id
func1	string	The full text of the first function
func2	string	The full text of the second function
label	bool	1 is the functions are not equivalent, 0 otherwise

数据拆分

name	train	validation	test
default	901028	415416	415416

数据集创建

调优理由

[需要更多信息]

来源数据

初始数据收集和规范化

数据是从IJaDataset 2.0数据集中挖掘得到的。[需要更多信息]

谁是源语言的生产者？

[需要更多信息]

注释

注释过程

通过使用搜索启发式方法自动识别潜在的克隆，并由三名评委进行手动标注。[需要更多信息]

谁是注释者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏见讨论

大多数克隆属于类型1和2，类型3和尤其是类型4的克隆很少。

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集创建者

https://github.com/microsoft ， https://github.com/madlag

授权信息

计算数据使用协议（C-UDA）许可。

引文信息

@inproceedings{svajlenko2014towards,
  title={Towards a big data curated benchmark of inter-project code clones},
  author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
  booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
  pages={476--480},
  year={2014},
  organization={IEEE}
}

@inproceedings{wang2020detecting,
  title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
  author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
  booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={261--271},
  year={2020},
  organization={IEEE}
}

贡献

感谢@madlag（部分也感谢@ncoop57）添加了这个数据集。

作者:

佚名

数据集大小:

17.82 KB