模型:
Salesforce/codet5-large
CodeT5是一系列编码器-解码器语言模型,用于代码,论文标题: CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation ,作者:Yue Wang,Weishi Wang,Shafiq Joty和Steven C.H. Hoi。
此存储库中的检查点被称为CodeT5-large (770M),是由论文标题为: CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning 的Hung Le,Yue Wang,Akhilesh Deepak Gotmare,Silvio Savarese,Steven C.H. Hoi介绍的。
CodeT5-large在六种编程语言(Ruby / JavaScript / Go / Python / Java / PHP)的 CodeSearchNet 数据上进行了预训练。有关详细信息,请参阅 paper 的第4.1节。
CodeT5-large使用遮盖跨度预测目标进行了150个时期的预训练。有关详细信息,请参阅 paper 的第4.1节。
我们验证了此预训练检查点在 CodeXGLUE 基准上使用简化策略的有效性。有关详细信息,请参见 paper 的附录A.1。
可以使用T5ForConditionalGeneration功能轻松加载此模型:
from transformers import AutoTokenizer, T5ForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-large") model = T5ForConditionalGeneration.from_pretrained("Salesforce/codet5-large") text = "def greet(user): print(f'hello <extra_id_0>!')" input_ids = tokenizer(text, return_tensors="pt").input_ids # simply generate a single sequence generated_ids = model.generate(input_ids, max_length=8) print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
@inproceedings{CodeT52021, author = {Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi}, title = {CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, booktitle = {EMNLP}, pages = {8696--8708}, publisher = {Association for Computational Linguistics}, year = {2021} } @article{CodeRL2022 author = {Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, Steven C.H. Hoi}, title = {CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning}, journal = {arXiv preprint}, volume = {abs/2207.01780}, year = {2022} }