XLCost for text-to-code synthesis

数据集描述

这是一个包含 XLCoST benchmark 条记录的数据集，用于进行文本到代码的生成，支持 7 种编程语言：Python、C、C#、C++、Java、Javascript 和 PHP。

语言

数据集中包含英文文本及其对应的代码翻译。每个程序都被分成多个代码片段，因此代码片段级别的子集包含这些代码片段及其相应的注释，而程序级别的子集将所有注释连接在一个长描述中。此外，所有语言的程序都在代码片段级别进行对齐，特定代码片段的注释在所有语言中是相同的。

数据集结构

要加载数据集，需要指定一种子集，即现有的 14 种实例之一：LANGUAGE-snippet-level/LANGUAGE-program-level，其中 LANGUAGE 可选值为 [Python, C, Csharp, C++, Java, Javascript 和 PHP]。默认加载 Python-snippet-level 子集。

from datasets import load_dataset
load_dataset("codeparrot/xlcost-text-to-code", "Python-program-level")

DatasetDict({
    train: Dataset({
        features: ['text', 'code'],
        num_rows: 9263
    })
    test: Dataset({
        features: ['text', 'code'],
        num_rows: 887
    })
    validation: Dataset({
        features: ['text', 'code'],
        num_rows: 472
    })
})

next(iter(data["train"]))
{'text': 'Maximum Prefix Sum possible by merging two given arrays | Python3 implementation of the above approach ; Stores the maximum prefix sum of the array A [ ] ; Traverse the array A [ ] ; Stores the maximum prefix sum of the array B [ ] ; Traverse the array B [ ] ; Driver code',
 'code': 'def maxPresum ( a , b ) : NEW_LINE INDENT X = max ( a [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( a ) ) : NEW_LINE INDENT a [ i ] += a [ i - 1 ] NEW_LINE X = max ( X , a [ i ] ) NEW_LINE DEDENT Y = max ( b [ 0 ] , 0 ) NEW_LINE for i in range ( 1 , len ( b ) ) : NEW_LINE INDENT b [ i ] += b [ i - 1 ] NEW_LINE Y = max ( Y , b [ i ] ) NEW_LINE DEDENT return X + Y NEW_LINE DEDENT A = [ 2 , - 1 , 4 , - 5 ] NEW_LINE B = [ 4 , - 3 , 12 , 4 , - 3 ] NEW_LINE print ( maxPresum ( A , B ) ) NEW_LINE'}

请注意，数据经过一些分词处理，因此会出现附加的空格，并使用 NEW_LINE 代替 \n，INDENT 代替 \t，DEDENT 代替取消缩进...

数据字段

text: 自然语言描述/注释
code: 代码，可为代码片段/整个程序

数据拆分

每个子集包含三个拆分：训练集、测试集和验证集。

引用信息

@misc{zhu2022xlcost,
     title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence},
     url = {https://arxiv.org/abs/2206.08474},
     author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.},
     year = {2022},
     eprint={2206.08474},
     archivePrefix={arXiv}
}

作者:

codeparrot

数据集大小:

131.06 MB