CodeT5 (大型模型770M)

模型描述

CodeT5是一系列编码器-解码器语言模型，用于代码，论文标题： CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation ，作者：Yue Wang，Weishi Wang，Shafiq Joty和Steven C.H. Hoi。

此存储库中的检查点被称为CodeT5-large (770M)，是由论文标题为： CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 的Hung Le，Yue Wang，Akhilesh Deepak Gotmare，Silvio Savarese，Steven C.H. Hoi介绍的。

训练数据

CodeT5-large在六种编程语言（Ruby / JavaScript / Go / Python / Java / PHP）的 CodeSearchNet 数据上进行了预训练。有关详细信息，请参阅 paper 的第4.1节。

训练过程

CodeT5-large使用遮盖跨度预测目标进行了150个时期的预训练。有关详细信息，请参阅 paper 的第4.1节。

评估结果

我们验证了此预训练检查点在 CodeXGLUE 基准上使用简化策略的有效性。有关详细信息，请参见 paper 的附录A.1。

如何使用

可以使用T5ForConditionalGeneration功能轻松加载此模型：

from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large")
model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large")
text = "def greet(user): print(f'hello <extra_id_0>!')"
input_ids = tokenizer(text, return_tensors="pt").input_ids

# simply generate a single sequence
generated_ids = model.generate(input_ids, max_length=8)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

BibTeX条目和引文信息

@inproceedings{CodeT52021,
  author    = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
  title     = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
  booktitle = {EMNLP},
  pages     = {8696--8708},
  publisher = {Association for Computational Linguistics},
  year      = {2021}
}

@article{CodeRL2022
  author    = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi},
  title     = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning},
  journal   = {arXiv preprint},
  volume    = {abs/2207.01780},
  year      = {2022}
}

作者:

Salesforce

数据集大小:

1.38 GB