CodeParrot 🦜（小）

CodeParrot 🦜 是一个GPT-2模型（110M参数），用于生成Python代码。

使用方法

您可以直接在transformers中加载CodeParrot模型和标记器：

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot-small")
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot-small")

inputs = tokenizer("def hello_world():", return_tensors="pt")
outputs = model(**inputs)

或者使用管道：

from transformers import pipeline

pipe = pipeline("text-generation", model="codeparrot/codeparrot-small")
outputs = pipe("def hello_world():")

训练

该模型是在清理后的 CodeParrot 🦜 dataset 上使用以下设置进行训练的：

Config	Value
Batch size	192
Context size	1024
Training steps	150'000
Gradient accumulation	1
Gradient checkpointing	False
Learning rate	5e-4
Weight decay	0.1
Warmup steps	2000
Schedule	Cosine

训练是在16个A100（40GB）GPU上执行的。此设置大约相当于290亿个标记。

性能

我们在OpenAI的 HumanEval 基准测试中对模型进行了评估，该基准测试包含编程挑战：

Metric	Value
pass@1	3.80%
pass@10	6.57%
pass@100	12.78%

pass@k metric 给出了至少有k个生成的实例通过测试的概率。

资源

数据集： full ， train ， valid
代码： repository
间隔： generation ， highlighting

作者:

CodeParrot

数据集大小:

607.26 MB