数据集:

code_x_glue_tc_nl_code_search_adv

任务:

文本检索

子任务:

document-retrieval

语言:

code

计算机处理:

other-programming-languages

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

c-uda

数据集介绍文件清单

英文

"code_x_glue_tc_nl_code_search_adv" 数据集卡片

数据集概述

CodeXGLUE NL-code-search-Adv 数据集，可在 https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv 获得。

我们使用的数据集来自 CodeSearchNet，并按照以下方式进行过滤：

删除无法解析为抽象语法树的代码示例。
删除文档中 #tokens 小于 3 或大于 256 的示例。
删除文档包含特殊标记（例如或 https:...）的示例。
删除非英文文档。

支持的任务和排行榜

document-retrieval ：可使用该数据集训练模型，从给定的英文自然语言查询中检索前 k 个代码。

语言

Python 编程语言
英文自然语言

数据集结构

数据实例

'validation' 的示例如下所示。

{
    "argument_list": "", 
    "code": "def Func(arg_0, arg_1='.', arg_2=True, arg_3=False, **arg_4):\n    \"\"\"Downloads Dailymotion videos by URL.\n    \"\"\"\n\n    arg_5 = get_content(rebuilt_url(arg_0))\n    arg_6 = json.loads(match1(arg_5, r'qualities\":({.+?}),\"'))\n    arg_7 = match1(arg_5, r'\"video_title\"\\s*:\\s*\"([^\"]+)\"') or \\\n            match1(arg_5, r'\"title\"\\s*:\\s*\"([^\"]+)\"')\n    arg_7 = unicodize(arg_7)\n\n    for arg_8 in ['1080','720','480','380','240','144','auto']:\n        try:\n            arg_9 = arg_6[arg_8][1][\"url\"]\n            if arg_9:\n                break\n        except KeyError:\n            pass\n\n    arg_10, arg_11, arg_12 = url_info(arg_9)\n\n    print_info(site_info, arg_7, arg_10, arg_12)\n    if not arg_3:\n        download_urls([arg_9], arg_7, arg_11, arg_12, arg_1=arg_1, arg_2=arg_2)", 
    "code_tokens": ["def", "Func", "(", "arg_0", ",", "arg_1", "=", "'.'", ",", "arg_2", "=", "True", ",", "arg_3", "=", "False", ",", "**", "arg_4", ")", ":", "arg_5", "=", "get_content", "(", "rebuilt_url", "(", "arg_0", ")", ")", "arg_6", "=", "json", ".", "loads", "(", "match1", "(", "arg_5", ",", "r'qualities\":({.+?}),\"'", ")", ")", "arg_7", "=", "match1", "(", "arg_5", ",", "r'\"video_title\"\\s*:\\s*\"([^\"]+)\"'", ")", "or", "match1", "(", "arg_5", ",", "r'\"title\"\\s*:\\s*\"([^\"]+)\"'", ")", "arg_7", "=", "unicodize", "(", "arg_7", ")", "for", "arg_8", "in", "[", "'1080'", ",", "'720'", ",", "'480'", ",", "'380'", ",", "'240'", ",", "'144'", ",", "'auto'", "]", ":", "try", ":", "arg_9", "=", "arg_6", "[", "arg_8", "]", "[", "1", "]", "[", "\"url\"", "]", "if", "arg_9", ":", "break", "except", "KeyError", ":", "pass", "arg_10", ",", "arg_11", ",", "arg_12", "=", "url_info", "(", "arg_9", ")", "print_info", "(", "site_info", ",", "arg_7", ",", "arg_10", ",", "arg_12", ")", "if", "not", "arg_3", ":", "download_urls", "(", "[", "arg_9", "]", ",", "arg_7", ",", "arg_11", ",", "arg_12", ",", "arg_1", "=", "arg_1", ",", "arg_2", "=", "arg_2", ")"], 
    "docstring": "Downloads Dailymotion videos by URL.", 
    "docstring_summary": "Downloads Dailymotion videos by URL.", 
    "docstring_tokens": ["Downloads", "Dailymotion", "videos", "by", "URL", "."], 
    "func_name": "", 
    "id": 0, 
    "identifier": "dailymotion_download", 
    "language": "python", 
    "nwo": "soimort/you-get", 
    "original_string": "", 
    "parameters": "(url, output_dir='.', merge=True, info_only=False, **kwargs)", 
    "path": "src/you_get/extractors/dailymotion.py", 
    "repo": "", 
    "return_statement": "", 
    "score": 0.9997601509094238, 
    "sha": "b746ac01c9f39de94cac2d56f665285b0523b974", 
    "url": "https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/dailymotion.py#L13-L35"
}

数据字段

以下是每个配置文件中的数据字段的解释。所有拆分之间的数据字段相同。

default

field name	type	description
id	int32	Index of the sample
repo	string	repo: the owner/repo
path	string	path: the full path to the original file
func_name	string	func_name: the function or method name
original_string	string	original_string: the raw string before tokenization or parsing
language	string	language: the programming language
code	string	code/function: the part of the original_string that is code
code_tokens	Sequence[string]	code_tokens/function_tokens: tokenized version of code
docstring	string	docstring: the top-level comment or docstring, if it exists in the original string
docstring_tokens	Sequence[string]	docstring_tokens: tokenized version of docstring
sha	string	sha of the file
url	string	url of the file
docstring_summary	string	Summary of the docstring
parameters	string	parameters of the function
return_statement	string	return statement
argument_list	string	list of arguments of the function
identifier	string	identifier
nwo	string	nwo
score	datasets.Value("float"]	score for this search

数据拆分

name	train	validation	test
default	251820	9604	19210

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

来自 CodeSearchNet Challenge 数据集的数据。[需要更多信息]

谁是源语言的生产者？

软件工程开发人员。

注释

注释过程

[需要更多信息]

谁是注释者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

对偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

https://github.com/microsoft , https://github.com/madlag

许可信息

计算数据使用协议（C-UDA）许可。

引用信息

@article{husain2019codesearchnet,
  title={Codesearchnet challenge: Evaluating the state of semantic code search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}

贡献

感谢 @madlag（部分也感谢 @ncoop57）添加了此数据集。

作者:

佚名

数据集大小:

25.87 KB