数据集:

code_x_glue_tc_nl_code_search_adv

语言:

code en

大小:

100K<n<1M

语言创建人:

found

批注创建人:

found

源数据集:

original

许可:

c-uda
英文

"code_x_glue_tc_nl_code_search_adv" 数据集卡片

数据集概述

CodeXGLUE NL-code-search-Adv 数据集,可在 https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv 获得。

我们使用的数据集来自 CodeSearchNet,并按照以下方式进行过滤:

  • 删除无法解析为抽象语法树的代码示例。
  • 删除文档中 #tokens 小于 3 或大于 256 的示例。
  • 删除文档包含特殊标记(例如 或 https:...)的示例。
  • 删除非英文文档。

支持的任务和排行榜

  • document-retrieval :可使用该数据集训练模型,从给定的英文自然语言查询中检索前 k 个代码。

语言

  • Python 编程语言
  • 英文 自然语言

数据集结构

数据实例

'validation' 的示例如下所示。

{
    "argument_list": "", 
    "code": "def Func(arg_0, arg_1='.', arg_2=True, arg_3=False, **arg_4):\n    \"\"\"Downloads Dailymotion videos by URL.\n    \"\"\"\n\n    arg_5 = get_content(rebuilt_url(arg_0))\n    arg_6 = json.loads(match1(arg_5, r'qualities\":({.+?}),\"'))\n    arg_7 = match1(arg_5, r'\"video_title\"\\s*:\\s*\"([^\"]+)\"') or \\\n            match1(arg_5, r'\"title\"\\s*:\\s*\"([^\"]+)\"')\n    arg_7 = unicodize(arg_7)\n\n    for arg_8 in ['1080','720','480','380','240','144','auto']:\n        try:\n            arg_9 = arg_6[arg_8][1][\"url\"]\n            if arg_9:\n                break\n        except KeyError:\n            pass\n\n    arg_10, arg_11, arg_12 = url_info(arg_9)\n\n    print_info(site_info, arg_7, arg_10, arg_12)\n    if not arg_3:\n        download_urls([arg_9], arg_7, arg_11, arg_12, arg_1=arg_1, arg_2=arg_2)", 
    "code_tokens": ["def", "Func", "(", "arg_0", ",", "arg_1", "=", "'.'", ",", "arg_2", "=", "True", ",", "arg_3", "=", "False", ",", "**", "arg_4", ")", ":", "arg_5", "=", "get_content", "(", "rebuilt_url", "(", "arg_0", ")", ")", "arg_6", "=", "json", ".", "loads", "(", "match1", "(", "arg_5", ",", "r'qualities\":({.+?}),\"'", ")", ")", "arg_7", "=", "match1", "(", "arg_5", ",", "r'\"video_title\"\\s*:\\s*\"([^\"]+)\"'", ")", "or", "match1", "(", "arg_5", ",", "r'\"title\"\\s*:\\s*\"([^\"]+)\"'", ")", "arg_7", "=", "unicodize", "(", "arg_7", ")", "for", "arg_8", "in", "[", "'1080'", ",", "'720'", ",", "'480'", ",", "'380'", ",", "'240'", ",", "'144'", ",", "'auto'", "]", ":", "try", ":", "arg_9", "=", "arg_6", "[", "arg_8", "]", "[", "1", "]", "[", "\"url\"", "]", "if", "arg_9", ":", "break", "except", "KeyError", ":", "pass", "arg_10", ",", "arg_11", ",", "arg_12", "=", "url_info", "(", "arg_9", ")", "print_info", "(", "site_info", ",", "arg_7", ",", "arg_10", ",", "arg_12", ")", "if", "not", "arg_3", ":", "download_urls", "(", "[", "arg_9", "]", ",", "arg_7", ",", "arg_11", ",", "arg_12", ",", "arg_1", "=", "arg_1", ",", "arg_2", "=", "arg_2", ")"], 
    "docstring": "Downloads Dailymotion videos by URL.", 
    "docstring_summary": "Downloads Dailymotion videos by URL.", 
    "docstring_tokens": ["Downloads", "Dailymotion", "videos", "by", "URL", "."], 
    "func_name": "", 
    "id": 0, 
    "identifier": "dailymotion_download", 
    "language": "python", 
    "nwo": "soimort/you-get", 
    "original_string": "", 
    "parameters": "(url, output_dir='.', merge=True, info_only=False, **kwargs)", 
    "path": "src/you_get/extractors/dailymotion.py", 
    "repo": "", 
    "return_statement": "", 
    "score": 0.9997601509094238, 
    "sha": "b746ac01c9f39de94cac2d56f665285b0523b974", 
    "url": "https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/dailymotion.py#L13-L35"
}

数据字段

以下是每个配置文件中的数据字段的解释。所有拆分之间的数据字段相同。

default
field name type description
id int32 Index of the sample
repo string repo: the owner/repo
path string path: the full path to the original file
func_name string func_name: the function or method name
original_string string original_string: the raw string before tokenization or parsing
language string language: the programming language
code string code/function: the part of the original_string that is code
code_tokens Sequence[string] code_tokens/function_tokens: tokenized version of code
docstring string docstring: the top-level comment or docstring, if it exists in the original string
docstring_tokens Sequence[string] docstring_tokens: tokenized version of docstring
sha string sha of the file
url string url of the file
docstring_summary string Summary of the docstring
parameters string parameters of the function
return_statement string return statement
argument_list string list of arguments of the function
identifier string identifier
nwo string nwo
score datasets.Value("float"] score for this search

数据拆分

name train validation test
default 251820 9604 19210

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和规范化

来自 CodeSearchNet Challenge 数据集的数据。[需要更多信息]

谁是源语言的生产者?

软件工程开发人员。

注释

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

对偏见的讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

https://github.com/microsoft , https://github.com/madlag

许可信息

计算数据使用协议(C-UDA)许可。

引用信息

@article{husain2019codesearchnet,
  title={Codesearchnet challenge: Evaluating the state of semantic code search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}

贡献

感谢 @madlag(部分也感谢 @ncoop57)添加了此数据集。