模型:

Daoguang/PyCodeGPT

英文

PyCodeGPT

一个预训练的GPT模型,用于Python代码的自动完成和生成

它是什么?

PyCodeGPT是基于GPT-Neo的高效而有效的模型,用于Python代码生成任务,类似于 OpenAI Codex , Github Copliot , CodeParrot , AlphaCode .

训练数据

由于公开发布的数据集规模较小,我们提议从头开始收集来自GitHub的数据。我们首先爬取了GitHub上托管的120万个与Python相关的仓库。然后,我们使用这些仓库的URL从GitHub下载了每个仓库的所有内容。之后,我们获取了1MB以下的60M原始Python文件,总共大小为330GB。最后,我们仔细设计了各种数据清洗策略,得到了大约96GB的用于训练的数据。请参考下表了解详细信息。

Model Repositories Size and file after filtering
CodeParrot 0.56M 12GB (compressed), 5.4M
Codex 54M 159GB
PyCodeGPT 1.2M 96GB, 13M

预训练模型

我们的目标是基于GPT-Neo训练中等规模的预训练模型(模型大小为110M):

  • PyCodeGPT-110M:基于GPT-Neo 125M,词汇表大小为32K。

GitHub

https://github.com/microsoft/PyCodeGPT

评估结果

这是我们在HumanEval数据集上的评估结果:

注意:我们的模型在类似模型规模上可以与Codex具有可比较的准确性。

Model Pass@1 Pass@10 Pass@100
PyCodeGPT-110M 8.32% 13.53% 18.3%
GPT-Neo 125M 0.75% 1.88% 2.97%
GPT-Neo 1.3B 4.97% 7.47% 16.3%
GPT-Neo 2.7B 6.41% 11.27% 21.37%
GPT-J 6B 11.62% 15.74% 27.74%
TabNine 2.58% 4.35% 7.59%
CodeParrot 110M 3.80% 6.57% 12.78%
CodeParrot 1.5B 3.58% 8.03% 14.96%
Codex 12M 2.00% 3.62% 8.58%
Codex 25M 3.21% 7.1% 12.89%
Codex 42M 5.06% 8.8% 15.55%
Codex 85M 8.22% 12.81% 22.4%
Codex 300M 13.17% 20.37% 36.27%
Codex 679M 16.22% 25.7% 40.95%
Codex 2.5B 21.36% 35.42% 59.5%
Codex 12B 28.81% 46.81% 72.31%
Pretrained Decoder-only 13M (AlphaCode) 1.5% 3.6% 8.6%
Pretrained Decoder-only 29M (AlphaCode) 3.4% 5.8% 11.2%
Pretrained Decoder-only 55M (AlphaCode) 4.2% 8.2% 16.9%
Pretrained Decoder-only 89M (AlphaCode) 4.3% 12.2% 20.0%
Pretrained Decoder-only 302M (AlphaCode) 11.6% 18.8% 31.8%
Pretrained Decoder-only 685M (AlphaCode) 14.2% 24.4% 38.8%
Pretrained Decoder-only 1.1B (AlphaCode) 17.1% 28.2% 45.3%
PolyCoder 160M 2.13% 3.35% 4.88%
PolyCoder 400M 2.96% 5.29% 11.59%
PolyCoder 2.7B 5.59% 9.84% 17.68%

参考文献

如果您想使用这些模型,需要引用我们的以下论文:

@inproceedings{CERT,
  title={{CERT}: Continual Pre-training on Sketches for Library-oriented Code Generation},
  author={Zan, Daoguang and Chen, Bei and Yang, Dejian and Lin, Zeqi and Kim, Minsu and Guan, Bei and Wang, Yongji and Chen, Weizhu and Lou, Jian-Guang},
  booktitle={The 2022 International Joint Conference on Artificial Intelligence},
  year={2022}
}