microsoft/codebert-base-mlm | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

microsoft/codebert-base-mlm

任务:

填充掩码

类库:

PyTorch TensorFlow JAX Rust Transformers

其他:

roberta AutoTrain Compatible

预印本库:

arxiv:2002.08155

模型介绍文件清单

英文

CodeBERT-base-mlm

CodeBERT: A Pre-Trained Model for Programming and Natural Languages 的预训练权重。

训练数据

该模型是在 CodeSearchNet 的代码语料库上进行训练的。

训练目标

该模型基于Roberta-base进行初始化，并使用简单的MLM（Masked Language Model）目标进行训练。

用法

from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline

model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm')
tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base-mlm')

code_example = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

outputs = fill_mask(code_example)
print(outputs)

预期结果：

{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}

参考

Bimodal CodeBERT trained with MLM+RTD objective （适用于代码搜索和文档生成）

🤗 Hugging Face's CodeBERTa （体积较小，6层）

引用

@misc{feng2020codebert,
    title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
    author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou},
    year={2020},
    eprint={2002.08155},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

作者:

Microsoft

数据集大小:

2.16 GB