数据集:
code_x_glue_tc_nl_code_search_adv
任务:
文本检索子任务:
document-retrieval计算机处理:
other-programming-languages大小:
100K<n<1M语言创建人:
found批注创建人:
found源数据集:
original许可:
c-udaCodeXGLUE NL-code-search-Adv 数据集,可在 https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv 获得。
我们使用的数据集来自 CodeSearchNet,并按照以下方式进行过滤:
'validation' 的示例如下所示。
{ "argument_list": "", "code": "def Func(arg_0, arg_1='.', arg_2=True, arg_3=False, **arg_4):\n \"\"\"Downloads Dailymotion videos by URL.\n \"\"\"\n\n arg_5 = get_content(rebuilt_url(arg_0))\n arg_6 = json.loads(match1(arg_5, r'qualities\":({.+?}),\"'))\n arg_7 = match1(arg_5, r'\"video_title\"\\s*:\\s*\"([^\"]+)\"') or \\\n match1(arg_5, r'\"title\"\\s*:\\s*\"([^\"]+)\"')\n arg_7 = unicodize(arg_7)\n\n for arg_8 in ['1080','720','480','380','240','144','auto']:\n try:\n arg_9 = arg_6[arg_8][1][\"url\"]\n if arg_9:\n break\n except KeyError:\n pass\n\n arg_10, arg_11, arg_12 = url_info(arg_9)\n\n print_info(site_info, arg_7, arg_10, arg_12)\n if not arg_3:\n download_urls([arg_9], arg_7, arg_11, arg_12, arg_1=arg_1, arg_2=arg_2)", "code_tokens": ["def", "Func", "(", "arg_0", ",", "arg_1", "=", "'.'", ",", "arg_2", "=", "True", ",", "arg_3", "=", "False", ",", "**", "arg_4", ")", ":", "arg_5", "=", "get_content", "(", "rebuilt_url", "(", "arg_0", ")", ")", "arg_6", "=", "json", ".", "loads", "(", "match1", "(", "arg_5", ",", "r'qualities\":({.+?}),\"'", ")", ")", "arg_7", "=", "match1", "(", "arg_5", ",", "r'\"video_title\"\\s*:\\s*\"([^\"]+)\"'", ")", "or", "match1", "(", "arg_5", ",", "r'\"title\"\\s*:\\s*\"([^\"]+)\"'", ")", "arg_7", "=", "unicodize", "(", "arg_7", ")", "for", "arg_8", "in", "[", "'1080'", ",", "'720'", ",", "'480'", ",", "'380'", ",", "'240'", ",", "'144'", ",", "'auto'", "]", ":", "try", ":", "arg_9", "=", "arg_6", "[", "arg_8", "]", "[", "1", "]", "[", "\"url\"", "]", "if", "arg_9", ":", "break", "except", "KeyError", ":", "pass", "arg_10", ",", "arg_11", ",", "arg_12", "=", "url_info", "(", "arg_9", ")", "print_info", "(", "site_info", ",", "arg_7", ",", "arg_10", ",", "arg_12", ")", "if", "not", "arg_3", ":", "download_urls", "(", "[", "arg_9", "]", ",", "arg_7", ",", "arg_11", ",", "arg_12", ",", "arg_1", "=", "arg_1", ",", "arg_2", "=", "arg_2", ")"], "docstring": "Downloads Dailymotion videos by URL.", "docstring_summary": "Downloads Dailymotion videos by URL.", "docstring_tokens": ["Downloads", "Dailymotion", "videos", "by", "URL", "."], "func_name": "", "id": 0, "identifier": "dailymotion_download", "language": "python", "nwo": "soimort/you-get", "original_string": "", "parameters": "(url, output_dir='.', merge=True, info_only=False, **kwargs)", "path": "src/you_get/extractors/dailymotion.py", "repo": "", "return_statement": "", "score": 0.9997601509094238, "sha": "b746ac01c9f39de94cac2d56f665285b0523b974", "url": "https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/dailymotion.py#L13-L35" }
以下是每个配置文件中的数据字段的解释。所有拆分之间的数据字段相同。
defaultfield name | type | description |
---|---|---|
id | int32 | Index of the sample |
repo | string | repo: the owner/repo |
path | string | path: the full path to the original file |
func_name | string | func_name: the function or method name |
original_string | string | original_string: the raw string before tokenization or parsing |
language | string | language: the programming language |
code | string | code/function: the part of the original_string that is code |
code_tokens | Sequence[string] | code_tokens/function_tokens: tokenized version of code |
docstring | string | docstring: the top-level comment or docstring, if it exists in the original string |
docstring_tokens | Sequence[string] | docstring_tokens: tokenized version of docstring |
sha | string | sha of the file |
url | string | url of the file |
docstring_summary | string | Summary of the docstring |
parameters | string | parameters of the function |
return_statement | string | return statement |
argument_list | string | list of arguments of the function |
identifier | string | identifier |
nwo | string | nwo |
score | datasets.Value("float"] | score for this search |
name | train | validation | test |
---|---|---|---|
default | 251820 | 9604 | 19210 |
[需要更多信息]
来自 CodeSearchNet Challenge 数据集的数据。[需要更多信息]
谁是源语言的生产者?软件工程开发人员。
[需要更多信息]
谁是注释者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
https://github.com/microsoft , https://github.com/madlag
计算数据使用协议(C-UDA)许可。
@article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }
感谢 @madlag(部分也感谢 @ncoop57)添加了此数据集。