PyCodeGPT

一个预训练的GPT模型，用于Python代码的自动完成和生成

它是什么？

PyCodeGPT是基于GPT-Neo的高效而有效的模型，用于Python代码生成任务，类似于 OpenAI Codex , Github Copliot , CodeParrot , AlphaCode .

训练数据

由于公开发布的数据集规模较小，我们提议从头开始收集来自GitHub的数据。我们首先爬取了GitHub上托管的120万个与Python相关的仓库。然后，我们使用这些仓库的URL从GitHub下载了每个仓库的所有内容。之后，我们获取了1MB以下的60M原始Python文件，总共大小为330GB。最后，我们仔细设计了各种数据清洗策略，得到了大约96GB的用于训练的数据。请参考下表了解详细信息。

Model	Repositories	Size and file after filtering
CodeParrot	0.56M	12GB (compressed), 5.4M
Codex	54M	159GB
PyCodeGPT	1.2M	96GB, 13M

预训练模型

我们的目标是基于GPT-Neo训练中等规模的预训练模型（模型大小为110M）：

PyCodeGPT-110M：基于GPT-Neo 125M，词汇表大小为32K。

GitHub

https://github.com/microsoft/PyCodeGPT

评估结果

这是我们在HumanEval数据集上的评估结果：

注意：我们的模型在类似模型规模上可以与Codex具有可比较的准确性。

Model	Pass@1	Pass@10	Pass@100
PyCodeGPT-110M	8.32%	13.53%	18.3%
GPT-Neo 125M	0.75%	1.88%	2.97%
GPT-Neo 1.3B	4.97%	7.47%	16.3%
GPT-Neo 2.7B	6.41%	11.27%	21.37%
GPT-J 6B	11.62%	15.74%	27.74%
TabNine	2.58%	4.35%	7.59%
CodeParrot 110M	3.80%	6.57%	12.78%
CodeParrot 1.5B	3.58%	8.03%	14.96%
Codex 12M	2.00%	3.62%	8.58%
Codex 25M	3.21%	7.1%	12.89%
Codex 42M	5.06%	8.8%	15.55%
Codex 85M	8.22%	12.81%	22.4%
Codex 300M	13.17%	20.37%	36.27%
Codex 679M	16.22%	25.7%	40.95%
Codex 2.5B	21.36%	35.42%	59.5%
Codex 12B	28.81%	46.81%	72.31%
Pretrained Decoder-only 13M (AlphaCode)	1.5%	3.6%	8.6%
Pretrained Decoder-only 29M (AlphaCode)	3.4%	5.8%	11.2%
Pretrained Decoder-only 55M (AlphaCode)	4.2%	8.2%	16.9%
Pretrained Decoder-only 89M (AlphaCode)	4.3%	12.2%	20.0%
Pretrained Decoder-only 302M (AlphaCode)	11.6%	18.8%	31.8%
Pretrained Decoder-only 685M (AlphaCode)	14.2%	24.4%	38.8%
Pretrained Decoder-only 1.1B (AlphaCode)	17.1%	28.2%	45.3%
PolyCoder 160M	2.13%	3.35%	4.88%
PolyCoder 400M	2.96%	5.29%	11.59%
PolyCoder 2.7B	5.59%	9.84%	17.68%

参考文献

如果您想使用这些模型，需要引用我们的以下论文：

@inproceedings{CERT,
  title={{CERT}: Continual Pre-training on Sketches for Library-oriented Code Generation},
  author={Zan, Daoguang and Chen, Bei and Yang, Dejian and Lin, Zeqi and Kim, Minsu and Guan, Bei and Wang, Yongji and Chen, Weizhu and Lou, Jian-Guang},
  booktitle={The 2022 International Joint Conference on Artificial Intelligence},
  year={2022}
}

作者:

Zan Daoguang

数据集大小:

473.69 MB