数据集:

code_x_glue_cc_clone_detection_poj104

任务:

文本检索

子任务:

document-retrieval

语言:

code

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

c-uda

数据集介绍文件清单

英文

"code_x_glue_cc_clone_detection_poj_104"数据集数据卡

数据集概要

CodeXGLUE克隆检测-POJ-104数据集，可在 https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 找到。

给定一段代码和一个候选集作为输入，任务是返回具有相同语义的前K个代码。模型通过MAP分数进行评估。我们在此任务中使用POJ-104数据集。

支持的任务和排行榜

文档检索：该数据集可用于训练模型，以检索具有相同语义的前K个代码。

语言

C++编程语言

数据集结构

数据示例

'train'的一个示例如下所示。

{
    "code": "\nint f(int shu,int min)\n{ \n  int k=1;\n  if(shu < min)\n  { \n    k= 0; \n   return k;\n  } \n  else\n {\n  for(int i = min;i<shu;i++)\n  { \n    if(shu%i == 0)\n    { \n         k=k+ f(shu/i,i); \n    } \n  \n    \n  } \n    return k; \n}\n} \n\nmain()\n{\n      int n,i,a;\n      scanf(\"%d\",&n);\n      \n      for(i=0;i<n;i++)\n      {\n          scanf(\"%d\",&a);\n          \n          if(i!=n-1)                                                        \n           printf(\"%d\\n\",f(a,2));\n           else\n           printf(\"%d\",f(a,2));                           \n                                      \n                     \n                      \n      }              \n                     \n                      \n                      }", 
    "id": 0, 
    "label": "home"
}

数据字段

下面解释了go中每个配置的每个数据字段。各个拆分之间的数据字段相同。

default

field name	type	description
id	int32	Index of the sample
code	string	The full text of the function
label	string	The id of problem that the source code solves

数据拆分

name	train	validation	test
default	32000	8000	12000

数据集创建

筛选理由

[需要更多信息]

源数据

初始数据收集和标准化

[需要更多信息]

谁是源语言的生产者?

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

https://github.com/microsoft ， https://github.com/madlag

许可信息

数据的计算使用协议(C-UDA)许可证。

引用信息

@inproceedings{mou2016convolutional,
  title={Convolutional neural networks over tree structures for programming language processing},
  author={Mou, Lili and Li, Ge and Zhang, Lu and Wang, Tao and Jin, Zhi},
  booktitle={Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence},
  pages={1287--1293},
  year={2016}
}

贡献

感谢@madlag（部分上也包括@ncoop57）添加了此数据集。

作者:

佚名

数据集大小:

14.71 KB