数据集:

code_x_glue_cc_cloze_testing_all

任务:

文本生成

填充掩码

子任务:

slot-filling

语言:

code

计算机处理:

monolingual

大小:

10K<n<100K 1K<n<10K

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

c-uda

数据集介绍文件清单

英文

"code_x_glue_cc_cloze_testing_all" 数据集卡片

数据集概述

CodeXGLUE ClozeTesting-all 数据集，位于 https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all

Cloze 测试在自然语言处理中被广泛使用以评估训练语言模型的性能。该任务旨在使用空白处的上下文预测答案，可以将其形式化为多选分类问题。在这里，我们提供了两个代码领域的 Cloze 测试数据集，使用了六种不同的编程语言：ClozeTest-maxmin 和 ClozeTest-all。数据集中的每个实例都包含一个带有掩码的代码函数、其文档字符串和目标词语。ClozeTest-maxmin 和 ClozeTest-all 之间唯一的区别在于它们的选定词语集合，ClozeTest-maxmin 只包含两个词，而 ClozeTest-all 包含 930 个词。

支持的任务和排行榜

slot-filling：该数据集可用于训练模型，以预测代码片段中缺失的标记，类似于 Cloze 测试。

语言

Go编程语言
Java编程语言
JavaScript编程语言
PHP编程语言
Python编程语言
Ruby编程语言

数据集结构

数据实例

‘train’的一个示例如下所示。

{
    "id": 0, 
    "idx": "all-1", 
    "nl_tokens": ["MarshalJSON", "supports", "json", ".", "Marshaler", "interface"], 
    "pl_tokens": ["func", "(", "v", "ContextRealtimeData", ")", "MarshalJSON", "(", ")", "(", "[", "]", "byte", ",", "error", ")", "{", "w", ":=", "jwriter", ".", "<mask>", "{", "}", "\n", "easyjsonC5a4559bEncodeGithubComChromedpCdprotoWebaudio7", "(", "&", "w", ",", "v", ")", "\n", "return", "w", ".", "Buffer", ".", "BuildBytes", "(", ")", ",", "w", ".", "Error", "\n", "}"]
}

java

‘train’的一个示例如下所示。

{
    "id": 0, 
    "idx": "all-1", 
    "nl_tokens": ["/", "*", "(", "non", "-", "Javadoc", ")"], 
    "pl_tokens": ["@", "Override", "public", "int", "peekBit", "(", ")", "throws", "AACException", "{", "int", "ret", ";", "if", "(", "bitsCached", ">", "0", ")", "{", "ret", "=", "(", "cache", ">>", "(", "bitsCached", "-", "1", ")", ")", "&", "1", ";", "}", "else", "{", "final", "int", "word", "=", "readCache", "(", "true", ")", ";", "ret", "=", "(", "<mask>", ">>", "WORD_BITS", "-", "1", ")", "&", "1", ";", "}", "return", "ret", ";", "}"]
}

javascript

‘train’的一个示例如下所示。

{
    "id": 0, 
    "idx": "all-1", 
    "nl_tokens": ["Cast", "query", "params", "according", "to", "type"], 
    "pl_tokens": ["function", "castQueryParams", "(", "relId", ",", "data", ",", "{", "relationships", "}", ")", "{", "const", "relationship", "=", "relationships", "[", "relId", "]", "if", "(", "!", "relationship", ".", "query", ")", "{", "return", "{", "}", "}", "return", "Object", ".", "keys", "(", "relationship", ".", "query", ")", ".", "reduce", "(", "(", "params", ",", "<mask>", ")", "=>", "{", "const", "value", "=", "getField", "(", "data", ",", "relationship", ".", "query", "[", "key", "]", ")", "if", "(", "value", "===", "undefined", ")", "{", "throw", "new", "TypeError", "(", "'Missing value for query param'", ")", "}", "return", "{", "...", "params", ",", "[", "key", "]", ":", "value", "}", "}", ",", "{", "}", ")", "}"]
}

php

‘train’的一个示例如下所示。

{
    "id": 0, 
    "idx": "all-1", 
    "nl_tokens": ["Get", "choices", "."], 
    "pl_tokens": ["protected", "<mask>", "getChoices", "(", "FormFieldTranslation", "$", "translation", ")", "{", "$", "choices", "=", "preg_split", "(", "'/\\r\\n|\\r|\\n/'", ",", "$", "translation", "->", "getOption", "(", "'choices'", ")", ",", "-", "1", ",", "PREG_SPLIT_NO_EMPTY", ")", ";", "return", "array_combine", "(", "$", "choices", ",", "$", "choices", ")", ";", "}"]
}

python

‘train’的一个示例如下所示。

{
    "id": 0, 
    "idx": "all-1", 
    "nl_tokens": ["Post", "a", "review"], 
    "pl_tokens": ["def", "post_review", "(", "session", ",", "review", ")", ":", "# POST /api/projects/0.1/reviews/", "<mask>", "=", "make_post_request", "(", "session", ",", "'reviews'", ",", "json_data", "=", "review", ")", "json_data", "=", "response", ".", "json", "(", ")", "if", "response", ".", "status_code", "==", "200", ":", "return", "json_data", "[", "'status'", "]", "else", ":", "raise", "ReviewNotPostedException", "(", "message", "=", "json_data", "[", "'message'", "]", ",", "error_code", "=", "json_data", "[", "'error_code'", "]", ",", "request_id", "=", "json_data", "[", "'request_id'", "]", ")"]
}

ruby

‘train’的一个示例如下所示。

{
    "id": 0, 
    "idx": "all-1", 
    "nl_tokens": ["By", "default", "taskers", "don", "t", "see", "the", "flor", "variables", "in", "the", "execution", ".", "If", "include_vars", "or", "exclude_vars", "is", "present", "in", "the", "configuration", "of", "the", "tasker", "some", "or", "all", "of", "the", "variables", "are", "passed", "."], 
    "pl_tokens": ["def", "gather_vars", "(", "executor", ",", "tconf", ",", "message", ")", "# try to return before a potentially costly call to executor.vars(nid)", "return", "nil", "if", "(", "tconf", ".", "keys", "&", "%w[", "include_vars", "exclude_vars", "]", ")", ".", "empty?", "# default behaviour, don't pass variables to taskers", "iv", "=", "expand_filter", "(", "tconf", "[", "'include_vars'", "]", ")", "return", "nil", "if", "iv", "==", "false", "ev", "=", "expand_filter", "(", "tconf", "[", "'exclude_vars'", "]", ")", "return", "{", "}", "if", "ev", "==", "true", "vars", "=", "executor", ".", "vars", "(", "message", "[", "'nid'", "]", ")", "return", "vars", "if", "iv", "==", "true", "vars", "=", "vars", ".", "select", "{", "|", "k", ",", "v", "|", "var_match", "(", "k", ",", "iv", ")", "}", "if", "<mask>", "vars", "=", "vars", ".", "reject", "{", "|", "k", ",", "v", "|", "var_match", "(", "k", ",", "ev", ")", "}", "if", "ev", "vars", "end"]
}

数据字段

在下面，我将解释 go 的每个配置中的每个数据字段。数据字段在所有拆分中是相同的。

go、java、javascript、php、python、ruby

field name	type	description
id	int32	Index of the sample
idx	string	Original index in the dataset
nl_tokens	Sequence[string]	Natural language tokens
pl_tokens	Sequence[string]	Programming language tokens

数据拆分

name	train
go	25282
java	40492
javascript	13837
php	51930
python	40137
ruby	4437

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

来自 CodeSearchNet Challenge 数据集的数据。[需要更多信息]

谁是源语言生产者？

软件工程开发人员。

注释

注释过程

[需要更多信息]

谁是注释者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

https://github.com/microsoft ， https://github.com/madlag

许可信息

数据计算使用协议（C-UDA）许可证。

引用信息

@article{CodeXGLUE,
  title={CodeXGLUE: An Open Challenge for Code Intelligence},
  journal={arXiv},
  year={2020},
}
@article{feng2020codebert,
  title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
  author={Feng, Zhangyin and Guo, Daya and Tang, Duyu and Duan, Nan and Feng, Xiaocheng and Gong, Ming and Shou, Linjun and Qin, Bing and Liu, Ting and Jiang, Daxin and others},
  journal={arXiv preprint arXiv:2002.08155},
  year={2020}
}
@article{husain2019codesearchnet,
  title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}

贡献

感谢 @madlag（部分也感谢 @ncoop57）添加了该数据集。

作者:

佚名

数据集大小:

38.8 KB