模型:
Salesforce/codet5-large-ntp-py
CodeT5是一组用于代码的编码器-解码器语言模型,来自论文: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation ,作者是王跃、王伟石、Shafiq Joty和Steven C.H. Hoi。
此存储库中包含的检查点称为CodeT5-large-ntp-py(770M),由论文 CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 的Le Hung、王跃、Akhilesh Deepak Gotmare、Silvio Savarese和Steven C.H. Hoi引入。
CodeT5-large-ntp-py是在六种编程语言(Ruby/JavaScript/Go/Python/Java/PHP)的 CodeSearchNet 数据和GCPY( Github Code 的Python分割)数据上进行预训练的。有关详细信息,请参见 paper 的第4.1节。
CodeT5-large-ntp-py首先在CodeSearchNet上使用蒙版跨度预测(MSP)目标进行150个epoch的预训练,然后在GCPY上进行10个epoch的预训练,并使用下一个标记预测(NTP)目标进行另外10个epoch的预训练。有关详细信息,请参见 paper 的第4.1节。
我们在 APPS 的基准测试中评估了此检查点。有关详细信息,请参见 paper 的表5。
可以使用T5ForConditionalGeneration的功能轻松加载此模型:
from transformers import AutoTokenizer, T5ForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large-ntp-py") model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large-ntp-py") text = "def hello_world():" input_ids = tokenizer(text, return_tensors="pt").input_ids # simply generate a single sequence generated_ids = model.generate(input_ids, max_length=128) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
@inproceedings{CodeT52021, author = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi}, title = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, booktitle = {EMNLP}, pages = {8696--8708}, publisher = {Association for Computational Linguistics}, year = {2021} } @article{CodeRL2022 author = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi}, title = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning}, journal = {arXiv preprint}, volume = {abs/2207.01780}, year = {2022} }