数据集:

Fsoft-AIC/the-vault-function

英文

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Dataset Summary

The Vault数据集是一个全面的大规模多语言并行数据集,其中包含高质量的代码-文本对,这些对是从最大的具有许可证的源代码数据集The Stack派生而来的。

我们提供The Vault,其中包含来自10种流行编程语言(如Java、JavaScript、Python、Ruby、Rust、Golang、C#、C ++、C和PHP)的代码片段。该数据集提供了多种代码片段级别、元数据和11种文档注释样式,以提高可用性和多样性。

支持的任务

The Vault可用于预训练LLM或下游的代码-文本交互任务。可以使用The Vault构建与代码理解和生成相关的多项任务,如代码摘要生成、文本到代码生成和代码搜索。

语言

自然语言文本(文档注释)为英语。

The Vault支持10种编程语言:Python,Java,JavaScript,PHP,C,C#,C ++,Go,Ruby,Rust

数据集结构

数据实例

{

    "hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0",
    "repo": "neumanna94/beepboop",
    "path": "js/scripts.js",
    "license": [
        "MIT"
    ],
    "language": "JavaScript",
    "identifier": "beepBoopSelector",
    "return_type": "<not_specific>",
    "original_string": "function beepBoopSelector(inputString, bbFunction){\n  if(bbFunction==1){\n    return beepBoop(inputString);\n  } else if(bbFunction==2){\n    return beepBoop2(inputString);\n  } else if(bbFunction==3){\n    return beepBoop3(inputString);\n  } else {\n  }\n}",
    "original_docstring": "//Determines what beepBoop function to use",
    "docstring": "Determines what beepBoop function to use",
    "docstring_tokens": [
        "Determines",
        "what",
        "beepBoop",
        "function",
        "to",
        "use"
    ],
    "code": "function beepBoopSelector(inputString, bbFunction){\n  if(bbFunction==1){\n    return beepBoop(inputString);\n  } else if(bbFunction==2){\n    return beepBoop2(inputString);\n  } else if(bbFunction==3){\n    return beepBoop3(inputString);\n  } else {\n  }\n}",
    "code_tokens": [
        "function",
        "beepBoopSelector",
        "(",
        "inputString",
        ",",
        "bbFunction",
        ")",
        "{",
        "if",
        "(",
        "bbFunction",
        "==",
        "1",
        ")",
        "{",
        "return",
        "beepBoop",
        "(",
        "inputString",
        ")",
        ";",
        "}",
        "else",
        "if",
        "(",
        "bbFunction",
        "==",
        "2",
        ")",
        "{",
        "return",
        "beepBoop2",
        "(",
        "inputString",
        ")",
        ";",
        "}",
        "else",
        "if",
        "(",
        "bbFunction",
        "==",
        "3",
        ")",
        "{",
        "return",
        "beepBoop3",
        "(",
        "inputString",
        ")",
        ";",
        "}",
        "else",
        "{",
        "}",
        "}"
    ],

    "short_docstring": "Determines what beepBoop function to use",
    "short_docstring_tokens": [
        "Determines",
        "what",
        "beepBoop",
        "function",
        "to",
        "use"
    ],
    "comment": [],
    "parameters": [
        {
            "param": "inputString",
            "type": null
        },
        {
            "param": "bbFunction",
            "type": null
        }
    ],
    "docstring_params": {
        "returns": [],
        "raises": [],
        "params": [
            {
                "identifier": "inputString",
                "type": null,
                "docstring": null,
                "docstring_tokens": [],
                "default": null,
                "is_optional": null
            },
            {
                "identifier": "bbFunction",
                "type": null,
                "docstring": null,
                "docstring_tokens": [],
                "default": null,
                "is_optional": null
            }
        ],
        "outlier_params": [],
        "others": []
    }
}

数据字段

函数级别的数据字段:

  • hexsha(字符串):文件的唯一git哈希
  • repo(字符串):所有者/仓库
  • 路径(字符串):原始文件的完整路径
  • 许可证(列表):仓库中的许可证
  • 语言(字符串):编程语言
  • 标识符(字符串):函数或方法名
  • 返回类型(字符串):函数返回的类型
  • original_string(字符串):函数/类节点的原始版本
  • original_docstring(字符串):标记化或解析之前的原始字符串
  • 代码(字符串):原始代码的一部分
  • code_tokens(列表):代码的标记化版本
  • short_docstring(字符串):简短的摘要(文档注释的第一行)
  • short_docstring_tokens(列表):short_docstring的标记化版本
  • docstring(字符串):顶级注释或文档注释(去除了参数注释、返回值、异常等字段的文档注释版本)
  • docstring_tokens(列表):docstring的标记化版本
  • 注释(列表):函数/类内的注释(行)列表
  • 参数(列表):参数及其类型的列表(类型可以为None)
  • docstring_params(字典):从文档注释中解析得到的信息的字典

有关更多详细信息和示例,请参阅 here

数据分割

在此存储库中,The Vault被分为5个子集,其中三个训练版本根据完整训练集的大小进行划分,其余的是验证集和测试集(每个集合中大约有20,000个样本)。每个分割集中各语言的统计数据如下节所示。

在分割之前,数据集已进行了去重处理。训练集有3个版本,分别为小型(5%)、中型(20%)和大型(100%)。

数据集统计信息

  • 与其他基准的比较
Dataset #Language #Code-text pair
PyMT5 1 ≈ 7,700,000
CoDesc 1 4,211,516
CodeSearchNet 6 2,326,976
CodeSearchNet (CodeXGLUE) 6 1,005,474
Deepcom 1 424,028
CONCODE 1 2,184,310
Funcom 1 2,149,121
CodeT5 8 3,158,313
The Vault 10 34,098,775
  • 分割集的统计数据
train/small train/medium train/full validation test total
Python 370,657 1,952,110 7,772,647 30,992 21,652 7,825,291
Java 351,213 1,612,366 6,629,193 22,677 15,552 6,667,422
JavaScript 82,931 404,729 1,640,416 22,044 21,108 1,683,568
PHP 236,638 1,155,476 4,656,371 21,375 19,010 4,696,756
C 105,978 381,207 1,639,319 27,525 19,122 1,685,966
C# 141,090 783,166 3,305,891 24,787 19,638 3,350,316
C++ 87,420 410,907 1,671,268 20,011 18,169 1,709,448
Go 267,535 1,319,547 5,109,020 19,102 25,314 5,153,436
Ruby 23,921 112,574 424,339 17,338 19,908 461,585
Rust 35,367 224,015 825,130 16,716 23,141 864,987
TOTAL 1,702,750 8,356,097 33,673,594 222,567 202,614 34,098,775

使用方法

您可以使用datasets库加载The Vault数据集:pip install datasets

from datasets import load_dataset

# Load full function level dataset (34M samples)
dataset = load_dataset("Fsoft-AIC/the-vault-function")

# Load function level train/validation/test set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])

# Load "small" (or "medium", "full") version of function level training set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])

# specific language (e.g. Python) 
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['Python'])

# dataset streaming
data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)
for sample in iter(data['train']): 
    print(sample)

备份数据集可以从Azure存储中下载。请参阅 Download The Vault from Azure blob storage

其他信息

许可信息

MIT许可证

引用信息

@article{manh2023vault,
  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
  journal={arXiv preprint arXiv:2305.06156},
  year={2023}
}

贡献

此数据集由 FSOFT AI4Code team 开发。