数据集:
Fsoft-AIC/the-vault-function
The Vault数据集是一个全面的大规模多语言并行数据集,其中包含高质量的代码-文本对,这些对是从最大的具有许可证的源代码数据集The Stack派生而来的。
我们提供The Vault,其中包含来自10种流行编程语言(如Java、JavaScript、Python、Ruby、Rust、Golang、C#、C ++、C和PHP)的代码片段。该数据集提供了多种代码片段级别、元数据和11种文档注释样式,以提高可用性和多样性。
The Vault可用于预训练LLM或下游的代码-文本交互任务。可以使用The Vault构建与代码理解和生成相关的多项任务,如代码摘要生成、文本到代码生成和代码搜索。
自然语言文本(文档注释)为英语。
The Vault支持10种编程语言:Python,Java,JavaScript,PHP,C,C#,C ++,Go,Ruby,Rust
{ "hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0", "repo": "neumanna94/beepboop", "path": "js/scripts.js", "license": [ "MIT" ], "language": "JavaScript", "identifier": "beepBoopSelector", "return_type": "<not_specific>", "original_string": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}", "original_docstring": "//Determines what beepBoop function to use", "docstring": "Determines what beepBoop function to use", "docstring_tokens": [ "Determines", "what", "beepBoop", "function", "to", "use" ], "code": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}", "code_tokens": [ "function", "beepBoopSelector", "(", "inputString", ",", "bbFunction", ")", "{", "if", "(", "bbFunction", "==", "1", ")", "{", "return", "beepBoop", "(", "inputString", ")", ";", "}", "else", "if", "(", "bbFunction", "==", "2", ")", "{", "return", "beepBoop2", "(", "inputString", ")", ";", "}", "else", "if", "(", "bbFunction", "==", "3", ")", "{", "return", "beepBoop3", "(", "inputString", ")", ";", "}", "else", "{", "}", "}" ], "short_docstring": "Determines what beepBoop function to use", "short_docstring_tokens": [ "Determines", "what", "beepBoop", "function", "to", "use" ], "comment": [], "parameters": [ { "param": "inputString", "type": null }, { "param": "bbFunction", "type": null } ], "docstring_params": { "returns": [], "raises": [], "params": [ { "identifier": "inputString", "type": null, "docstring": null, "docstring_tokens": [], "default": null, "is_optional": null }, { "identifier": "bbFunction", "type": null, "docstring": null, "docstring_tokens": [], "default": null, "is_optional": null } ], "outlier_params": [], "others": [] } }
函数级别的数据字段:
有关更多详细信息和示例,请参阅 here 。
在此存储库中,The Vault被分为5个子集,其中三个训练版本根据完整训练集的大小进行划分,其余的是验证集和测试集(每个集合中大约有20,000个样本)。每个分割集中各语言的统计数据如下节所示。
在分割之前,数据集已进行了去重处理。训练集有3个版本,分别为小型(5%)、中型(20%)和大型(100%)。
Dataset | #Language | #Code-text pair |
---|---|---|
PyMT5 | 1 | ≈ 7,700,000 |
CoDesc | 1 | 4,211,516 |
CodeSearchNet | 6 | 2,326,976 |
CodeSearchNet (CodeXGLUE) | 6 | 1,005,474 |
Deepcom | 1 | 424,028 |
CONCODE | 1 | 2,184,310 |
Funcom | 1 | 2,149,121 |
CodeT5 | 8 | 3,158,313 |
The Vault | 10 | 34,098,775 |
train/small | train/medium | train/full | validation | test | total | |
---|---|---|---|---|---|---|
Python | 370,657 | 1,952,110 | 7,772,647 | 30,992 | 21,652 | 7,825,291 |
Java | 351,213 | 1,612,366 | 6,629,193 | 22,677 | 15,552 | 6,667,422 |
JavaScript | 82,931 | 404,729 | 1,640,416 | 22,044 | 21,108 | 1,683,568 |
PHP | 236,638 | 1,155,476 | 4,656,371 | 21,375 | 19,010 | 4,696,756 |
C | 105,978 | 381,207 | 1,639,319 | 27,525 | 19,122 | 1,685,966 |
C# | 141,090 | 783,166 | 3,305,891 | 24,787 | 19,638 | 3,350,316 |
C++ | 87,420 | 410,907 | 1,671,268 | 20,011 | 18,169 | 1,709,448 |
Go | 267,535 | 1,319,547 | 5,109,020 | 19,102 | 25,314 | 5,153,436 |
Ruby | 23,921 | 112,574 | 424,339 | 17,338 | 19,908 | 461,585 |
Rust | 35,367 | 224,015 | 825,130 | 16,716 | 23,141 | 864,987 |
TOTAL | 1,702,750 | 8,356,097 | 33,673,594 | 222,567 | 202,614 | 34,098,775 |
您可以使用datasets库加载The Vault数据集:pip install datasets
from datasets import load_dataset # Load full function level dataset (34M samples) dataset = load_dataset("Fsoft-AIC/the-vault-function") # Load function level train/validation/test set dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"]) # Load "small" (or "medium", "full") version of function level training set dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"]) # specific language (e.g. Python) dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['Python']) # dataset streaming data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True) for sample in iter(data['train']): print(sample)
备份数据集可以从Azure存储中下载。请参阅 Download The Vault from Azure blob storage 。
MIT许可证
@article{manh2023vault, title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, journal={arXiv preprint arXiv:2305.06156}, year={2023} }
此数据集由 FSOFT AI4Code team 开发。