CodeT5（使用Python在NTP目标上预训练的大型模型）

模型描述

CodeT5是一组用于代码的编码器-解码器语言模型，来自论文： CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation ，作者是王跃、王伟石、Shafiq Joty和Steven C.H. Hoi。

此存储库中包含的检查点称为CodeT5-large-ntp-py（770M），由论文 CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 的Le Hung、王跃、Akhilesh Deepak Gotmare、Silvio Savarese和Steven C.H. Hoi引入。

训练数据

CodeT5-large-ntp-py是在六种编程语言（Ruby/JavaScript/Go/Python/Java/PHP）的 CodeSearchNet 数据和GCPY（ Github Code 的Python分割）数据上进行预训练的。有关详细信息，请参见 paper 的第4.1节。

训练过程

CodeT5-large-ntp-py首先在CodeSearchNet上使用蒙版跨度预测（MSP）目标进行150个epoch的预训练，然后在GCPY上进行10个epoch的预训练，并使用下一个标记预测（NTP）目标进行另外10个epoch的预训练。有关详细信息，请参见 paper 的第4.1节。

评估结果

我们在 APPS 的基准测试中评估了此检查点。有关详细信息，请参见 paper 的表5。

如何使用

可以使用T5ForConditionalGeneration的功能轻松加载此模型：

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py")
text = "def hello_world():"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

BibTeX条目和引用信息

@inproceedings{CodeT52021,
  author    = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
  title     = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
  booktitle = {EMNLP},
  pages     = {8696--8708},
  publisher = {Association for Computational Linguistics},
  year      = {2021}
}

@article{CodeRL2022
  author    = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
  title     = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
  journal   = {arXiv preprint},
  volume    = {abs/2207.01780},
  year      = {2022}
}

作者:

Salesforce

数据集大小:

1.38 GB