英文

CodeT5(使用Python在NTP目标上预训练的大型模型)

模型描述

CodeT5是一组用于代码的编码器-解码器语言模型,来自论文: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation ,作者是王跃、王伟石、Shafiq Joty和Steven C.H. Hoi。

此存储库中包含的检查点称为CodeT5-large-ntp-py(770M),由论文 CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 的Le Hung、王跃、Akhilesh Deepak Gotmare、Silvio Savarese和Steven C.H. Hoi引入。

训练数据

CodeT5-large-ntp-py是在六种编程语言(Ruby/JavaScript/Go/Python/Java/PHP)的 CodeSearchNet 数据和GCPY( Github Code 的Python分割)数据上进行预训练的。有关详细信息,请参见 paper 的第4.1节。

训练过程

CodeT5-large-ntp-py首先在CodeSearchNet上使用蒙版跨度预测(MSP)目标进行150个epoch的预训练,然后在GCPY上进行10个epoch的预训练,并使用下一个标记预测(NTP)目标进行另外10个epoch的预训练。有关详细信息,请参见 paper 的第4.1节。

评估结果

我们在 APPS 的基准测试中评估了此检查点。有关详细信息,请参见 paper 的表5。

如何使用

可以使用T5ForConditionalGeneration的功能轻松加载此模型:

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

BibTeX条目和引用信息

@inproceedings{CodeT52021,
  author    = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
  title     = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
  booktitle = {EMNLP},
  pages     = {8696--8708},
  publisher = {Association for Computational Linguistics},
  year      = {2021}
}

@article{CodeRL2022
  author    = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
  title     = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
  journal   = {arXiv preprint},
  volume    = {abs/2207.01780},
  year      = {2022}
}