arxiv:2002.08155CodeBERT: A Pre-Trained Model for Programming and Natural Languages 的预训练权重。
该模型是在 CodeSearchNet 的代码语料库上进行训练的。
该模型基于Roberta-base进行初始化,并使用简单的MLM(Masked Language Model)目标进行训练。
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm') tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base-mlm') code_example = "if (x is not None) <mask> (x>1)" fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) outputs = fill_mask(code_example) print(outputs)
{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8} {'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50} {'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114} {'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172} {'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}
@misc{feng2020codebert, title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages}, author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou}, year={2020}, eprint={2002.08155}, archivePrefix={arXiv}, primaryClass={cs.CL} }