数据集:

NTU-NLP-sg/xCodeEval

任务:

翻译

标记分类

文生文

语言:

code

计算机处理:

multilingual

大小:

1M<n<10M 10M<n<100M

语言创建人:

found expert-generated

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2303.03004

其他:

programming-language code program-synthesis

许可:

cc-by-nc-4.0

数据集介绍文件清单

英文

github

xCodeEval

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

我们推出了xCodeEval，这是迄今为止规模最大的可执行多语言多任务基准测试，由来自大约7.5K个独特问题的25M个以文档为级别的编码示例组成，涵盖多达17种编程语言，具有执行级别的并行性。它包含总共七个任务，涉及代码理解、生成、翻译和检索，并采用基于执行的评估方法。我们开发了一个基于测试用例的多语言代码执行引擎 ExecEval ，支持xCodeEval中的所有编程语言。我们还提出了一种基于几何平均和图论原理的数据拆分和数据选择模式，用于平衡多个属性上的数据分布。

该存储库包含了xCodeEval paper 的示例代码和数据链接。

数据下载

当前该存储库支持huggingface load_dataset() API。按照以下示例加载个别示例的数据集。

import datasets

prog_synthesis_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "program_synthesis")
code_translation_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "code_translation")
tag_classification_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "tag_classification")
apr_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "apr")
pcode_compilation_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "code_compilation")
retrieval_code_code_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "retrieval_code_code")
retrieval_nl_code_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "retrieval_nl_code")
retrieval_corpus_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "retrieval_corpus")

Hf大型数据下载技巧。

如果您在数据处理中遇到长时间延迟，请添加 ignore_verifications=True。

prog_synthesis_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "program_synthesis", ignore_verifications=True)

如果您在数据下载中遇到长时间延迟，请使用huggingface流式模式。

prog_synthesis_dataset = datasets.load_dataset("NTU-NLP-sg/xCodeEval", "program_synthesis", streaming=True)

只给我原始数据（?）

数据也可以从huggingface的git LFS存储库中下载。

您可以使用以下命令下载完整数据。

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/NTU-NLP-sg/xCodeEval
cd xCodeEval
git lfs pull

要下载数据集的特定部分，

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/NTU-NLP-sg/xCodeEval
cd xCodeEval
git lfs pull --include "apr/test/*"

我们提出了7个任务。

Automatic Program Repair

Code-Code Retrieval

NL-Code Retrieval

不同任务的共同数据

如果您不使用huggingface load_dataset() API，您可能需要将某些数据链接到不同的任务上。

我们有两个数据文件需要多个任务使用。

problem_descriptions.jsonl

unittest_db.json

您可以在huggingface数据集存储库的 main 分支的根目录中找到这两个文件。为了避免数据冗余，我们没有将这些数据与相关任务一起包含，而是添加了一个唯一的ID src_uid以检索这些数据。

problem_descriptions.jsonl的结构

一个示例：

{
    "description": "There are $$$n$$$ positive integers $$$a_1, a_2, \\dots, a_n$$$. For the one move you can choose any even value $$$c$$$ and divide by two all elements that equal $$$c$$$.For example, if $$$a=[6,8,12,6,3,12]$$$ and you choose $$$c=6$$$, and $$$a$$$ is transformed into $$$a=[3,8,12,3,3,12]$$$ after the move.You need to find the minimal number of moves for transforming $$$a$$$ to an array of only odd integers (each element shouldn't be divisible by $$$2$$$).",
    "input_from": "standard input",
    "output_to": "standard output",
    "time_limit": "3 seconds",
    "memory_limit": "256 megabytes",
    "input_spec": "The first line of the input contains one integer $$$t$$$ ($$$1 \\le t \\le 10^4$$$) \u2014 the number of test cases in the input. Then $$$t$$$ test cases follow. The first line of a test case contains $$$n$$$ ($$$1 \\le n \\le 2\\cdot10^5$$$) \u2014 the number of integers in the sequence $$$a$$$. The second line contains positive integers $$$a_1, a_2, \\dots, a_n$$$ ($$$1 \\le a_i \\le 10^9$$$). The sum of $$$n$$$ for all test cases in the input doesn't exceed $$$2\\cdot10^5$$$.",
    "output_spec": "For $$$t$$$ test cases print the answers in the order of test cases in the input. The answer for the test case is the minimal number of moves needed to make all numbers in the test case odd (i.e. not divisible by $$$2$$$).",
    "notes": "NoteIn the first test case of the example, the optimal sequence of moves can be as follows:  before making moves $$$a=[40, 6, 40, 3, 20, 1]$$$;  choose $$$c=6$$$;  now $$$a=[40, 3, 40, 3, 20, 1]$$$;  choose $$$c=40$$$;  now $$$a=[20, 3, 20, 3, 20, 1]$$$;  choose $$$c=20$$$;  now $$$a=[10, 3, 10, 3, 10, 1]$$$;  choose $$$c=10$$$;  now $$$a=[5, 3, 5, 3, 5, 1]$$$ \u2014 all numbers are odd. Thus, all numbers became odd after $$$4$$$ moves. In $$$3$$$ or fewer moves, you cannot make them all odd.",
    "sample_inputs": [
        "4\n6\n40 6 40 3 20 1\n1\n1024\n4\n2 4 8 16\n3\n3 1 7"
    ],
    "sample_outputs": [
        "4\n10\n4\n0"
    ],
    "tags": [
        "number theory",
        "greedy"
    ],
    "src_uid": "afcd41492158e68095b01ff1e88c3dd4",
    "difficulty": 1200,
    "created_at": 1576321500
}

关键定义

description: 以文本格式描述的问题描述，其中数学运算使用LaTeX编写。

input_from: 程序应该如何获取单元测试。

output_to: 程序应该将单元测试结果输出到何处。

time_limit: 解决问题的时间限制。

memory_limit: 解决问题的内存限制。

input_spec: 输入将以何种顺序给出给程序？它还包括日期范围、类型和大小。

output_spec: 输出应如何打印。大多数情况下，单元测试结果与“精确字符串匹配”或“浮点数比较”（带有精度边界）相匹配。

sample_inputs: 期望解决描述中所描述问题的代码的样本输入。

sample_outputs: 针对sample_input期望解决描述中所描述问题的代码的预期输出。

notes: 关于sample_inputs和sample_outputs的说明。

tags: 问题的类别。

src_uid: 问题的唯一ID。在任务数据示例中引用此ID，而不是放置所有这些信息。

difficulty: 解决该问题对人类来说有多难（由人类专家注释）。

created_at: 发布问题时的Unix时间戳。使用Python中的datetime库将其解析为易于阅读的格式。

unittest_db.json的结构

json文件的结构：

unittest_db = {
    "db884d679d9cfb1dc4bc511f83beedda" : [
        {
            "input": "4\r\n3 2 3 2\r\n",
            "output": [
                "1"
            ],
        },
        {
            ...
        },
        ...
    ]
    "3bc096d8cd3418948d5be6bf297aa9b5":[
        ...
    ],
    ...
}

关键定义

unittest_db.json的字典键，即来自problem_descriptions.jsonl的src_uid。

input: 单元测试的输入。

output: 单元测试的预期输出列表。

引用

@misc{khan2023xcodeeval,
      title={xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval}, 
      author={Mohammad Abdullah Matin Khan and M Saiful Bari and Xuan Long Do and Weishi Wang and Md Rizwan Parvez and Shafiq Joty},
      year={2023},
      eprint={2303.03004},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

作者:

NTU-NLP-sg

数据集大小:

47.71 GB