CodeParrot-Multi 🦜（小型）

CodeParrot-Multi 🦜是一个GPT-2模型（110M参数），用于生成9种编程语言的代码："Java"，"JavaScript"，"PHP"，"Python"，"C#"，"C++"，"GO"，"Ruby"和"TypeScript"。

用法

您可以直接在transformers中加载CodeParrot-Multi模型和分词器：

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small-multi")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small-multi")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

或者使用pipeline：

from transformers import pipeline

pipe = pipeline("text-generation", model="codeparrot/codeparrot-small-multi")
outputs = pipe("def hello_world():")

训练

该模型在小型数据集 Github code small 上进行了训练，在近似去重后获得了 Github code dataset 的子集，并采用了以下设置：

Config	Value
Batch size	192
Context size	1024
Training steps	300'000
Gradient accumulation	2
Gradient checkpointing	False
Learning rate	5e-4
Weight decay	0.1
Warmup steps	2000
Schedule	Cosine

训练使用了16个A100（40GB）的GPU。该设置大约相当于580亿个标记。

性能

我们在OpenAI的 HumanEval 基准测试上评估了该模型，该基准测试包含编程挑战：

Metric	Value
pass@1	--%
pass@10	--%
pass@100	--%

pass@k metric 表示至少有k个生成代码通过了测试的概率。

资源

代码： repository

作者:

CodeParrot

数据集大小:

236.54 MB