数据集:
Fsoft-AIC/the-vault-function
The Vault数据集是一个全面的大规模多语言并行数据集,其中包含高质量的代码-文本对,这些对是从最大的具有许可证的源代码数据集The Stack派生而来的。
我们提供The Vault,其中包含来自10种流行编程语言(如Java、JavaScript、Python、Ruby、Rust、Golang、C#、C ++、C和PHP)的代码片段。该数据集提供了多种代码片段级别、元数据和11种文档注释样式,以提高可用性和多样性。
The Vault可用于预训练LLM或下游的代码-文本交互任务。可以使用The Vault构建与代码理解和生成相关的多项任务,如代码摘要生成、文本到代码生成和代码搜索。
自然语言文本(文档注释)为英语。
The Vault支持10种编程语言:Python,Java,JavaScript,PHP,C,C#,C ++,Go,Ruby,Rust
{
"hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0",
"repo": "neumanna94/beepboop",
"path": "js/scripts.js",
"license": [
"MIT"
],
"language": "JavaScript",
"identifier": "beepBoopSelector",
"return_type": "<not_specific>",
"original_string": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}",
"original_docstring": "//Determines what beepBoop function to use",
"docstring": "Determines what beepBoop function to use",
"docstring_tokens": [
"Determines",
"what",
"beepBoop",
"function",
"to",
"use"
],
"code": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}",
"code_tokens": [
"function",
"beepBoopSelector",
"(",
"inputString",
",",
"bbFunction",
")",
"{",
"if",
"(",
"bbFunction",
"==",
"1",
")",
"{",
"return",
"beepBoop",
"(",
"inputString",
")",
";",
"}",
"else",
"if",
"(",
"bbFunction",
"==",
"2",
")",
"{",
"return",
"beepBoop2",
"(",
"inputString",
")",
";",
"}",
"else",
"if",
"(",
"bbFunction",
"==",
"3",
")",
"{",
"return",
"beepBoop3",
"(",
"inputString",
")",
";",
"}",
"else",
"{",
"}",
"}"
],
"short_docstring": "Determines what beepBoop function to use",
"short_docstring_tokens": [
"Determines",
"what",
"beepBoop",
"function",
"to",
"use"
],
"comment": [],
"parameters": [
{
"param": "inputString",
"type": null
},
{
"param": "bbFunction",
"type": null
}
],
"docstring_params": {
"returns": [],
"raises": [],
"params": [
{
"identifier": "inputString",
"type": null,
"docstring": null,
"docstring_tokens": [],
"default": null,
"is_optional": null
},
{
"identifier": "bbFunction",
"type": null,
"docstring": null,
"docstring_tokens": [],
"default": null,
"is_optional": null
}
],
"outlier_params": [],
"others": []
}
}
函数级别的数据字段:
有关更多详细信息和示例,请参阅 here 。
在此存储库中,The Vault被分为5个子集,其中三个训练版本根据完整训练集的大小进行划分,其余的是验证集和测试集(每个集合中大约有20,000个样本)。每个分割集中各语言的统计数据如下节所示。
在分割之前,数据集已进行了去重处理。训练集有3个版本,分别为小型(5%)、中型(20%)和大型(100%)。
| Dataset | #Language | #Code-text pair |
|---|---|---|
| PyMT5 | 1 | ≈ 7,700,000 |
| CoDesc | 1 | 4,211,516 |
| CodeSearchNet | 6 | 2,326,976 |
| CodeSearchNet (CodeXGLUE) | 6 | 1,005,474 |
| Deepcom | 1 | 424,028 |
| CONCODE | 1 | 2,184,310 |
| Funcom | 1 | 2,149,121 |
| CodeT5 | 8 | 3,158,313 |
| The Vault | 10 | 34,098,775 |
| train/small | train/medium | train/full | validation | test | total | |
|---|---|---|---|---|---|---|
| Python | 370,657 | 1,952,110 | 7,772,647 | 30,992 | 21,652 | 7,825,291 |
| Java | 351,213 | 1,612,366 | 6,629,193 | 22,677 | 15,552 | 6,667,422 |
| JavaScript | 82,931 | 404,729 | 1,640,416 | 22,044 | 21,108 | 1,683,568 |
| PHP | 236,638 | 1,155,476 | 4,656,371 | 21,375 | 19,010 | 4,696,756 |
| C | 105,978 | 381,207 | 1,639,319 | 27,525 | 19,122 | 1,685,966 |
| C# | 141,090 | 783,166 | 3,305,891 | 24,787 | 19,638 | 3,350,316 |
| C++ | 87,420 | 410,907 | 1,671,268 | 20,011 | 18,169 | 1,709,448 |
| Go | 267,535 | 1,319,547 | 5,109,020 | 19,102 | 25,314 | 5,153,436 |
| Ruby | 23,921 | 112,574 | 424,339 | 17,338 | 19,908 | 461,585 |
| Rust | 35,367 | 224,015 | 825,130 | 16,716 | 23,141 | 864,987 |
| TOTAL | 1,702,750 | 8,356,097 | 33,673,594 | 222,567 | 202,614 | 34,098,775 |
您可以使用datasets库加载The Vault数据集:pip install datasets
from datasets import load_dataset
# Load full function level dataset (34M samples)
dataset = load_dataset("Fsoft-AIC/the-vault-function")
# Load function level train/validation/test set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])
# Load "small" (or "medium", "full") version of function level training set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])
# specific language (e.g. Python)
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['Python'])
# dataset streaming
data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)
for sample in iter(data['train']):
print(sample)
备份数据集可以从Azure存储中下载。请参阅 Download The Vault from Azure blob storage 。
MIT许可证
@article{manh2023vault,
title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
journal={arXiv preprint arXiv:2305.06156},
year={2023}
}
此数据集由 FSOFT AI4Code team 开发。