模型:
Daoguang/PyCodeGPT
一个预训练的GPT模型,用于Python代码的自动完成和生成
PyCodeGPT是基于GPT-Neo的高效而有效的模型,用于Python代码生成任务,类似于 OpenAI Codex , Github Copliot , CodeParrot , AlphaCode .
由于公开发布的数据集规模较小,我们提议从头开始收集来自GitHub的数据。我们首先爬取了GitHub上托管的120万个与Python相关的仓库。然后,我们使用这些仓库的URL从GitHub下载了每个仓库的所有内容。之后,我们获取了1MB以下的60M原始Python文件,总共大小为330GB。最后,我们仔细设计了各种数据清洗策略,得到了大约96GB的用于训练的数据。请参考下表了解详细信息。
Model | Repositories | Size and file after filtering |
---|---|---|
CodeParrot | 0.56M | 12GB (compressed), 5.4M |
Codex | 54M | 159GB |
PyCodeGPT | 1.2M | 96GB, 13M |
我们的目标是基于GPT-Neo训练中等规模的预训练模型(模型大小为110M):
https://github.com/microsoft/PyCodeGPT
这是我们在HumanEval数据集上的评估结果:
注意:我们的模型在类似模型规模上可以与Codex具有可比较的准确性。
Model | Pass@1 | Pass@10 | Pass@100 |
---|---|---|---|
PyCodeGPT-110M | 8.32% | 13.53% | 18.3% |
GPT-Neo 125M | 0.75% | 1.88% | 2.97% |
GPT-Neo 1.3B | 4.97% | 7.47% | 16.3% |
GPT-Neo 2.7B | 6.41% | 11.27% | 21.37% |
GPT-J 6B | 11.62% | 15.74% | 27.74% |
TabNine | 2.58% | 4.35% | 7.59% |
CodeParrot 110M | 3.80% | 6.57% | 12.78% |
CodeParrot 1.5B | 3.58% | 8.03% | 14.96% |
Codex 12M | 2.00% | 3.62% | 8.58% |
Codex 25M | 3.21% | 7.1% | 12.89% |
Codex 42M | 5.06% | 8.8% | 15.55% |
Codex 85M | 8.22% | 12.81% | 22.4% |
Codex 300M | 13.17% | 20.37% | 36.27% |
Codex 679M | 16.22% | 25.7% | 40.95% |
Codex 2.5B | 21.36% | 35.42% | 59.5% |
Codex 12B | 28.81% | 46.81% | 72.31% |
Pretrained Decoder-only 13M (AlphaCode) | 1.5% | 3.6% | 8.6% |
Pretrained Decoder-only 29M (AlphaCode) | 3.4% | 5.8% | 11.2% |
Pretrained Decoder-only 55M (AlphaCode) | 4.2% | 8.2% | 16.9% |
Pretrained Decoder-only 89M (AlphaCode) | 4.3% | 12.2% | 20.0% |
Pretrained Decoder-only 302M (AlphaCode) | 11.6% | 18.8% | 31.8% |
Pretrained Decoder-only 685M (AlphaCode) | 14.2% | 24.4% | 38.8% |
Pretrained Decoder-only 1.1B (AlphaCode) | 17.1% | 28.2% | 45.3% |
PolyCoder 160M | 2.13% | 3.35% | 4.88% |
PolyCoder 400M | 2.96% | 5.29% | 11.59% |
PolyCoder 2.7B | 5.59% | 9.84% | 17.68% |
如果您想使用这些模型,需要引用我们的以下论文:
@inproceedings{CERT, title={{CERT}: Continual Pre-training on Sketches for Library-oriented Code Generation}, author={Zan, Daoguang and Chen, Bei and Yang, Dejian and Lin, Zeqi and Kim, Minsu and Guan, Bei and Wang, Yongji and Chen, Weizhu and Lou, Jian-Guang}, booktitle={The 2022 International Joint Conference on Artificial Intelligence}, year={2022} }