模型:

huggingface/CodeBERTa-language-id

英文

CodeBERTa-language-id:世界上最花哨的编程语言识别算法 ?

为了展示我们在语言模型之外的下游任务上使用CodeBERTa预训练模型的实用性,我们将 CodeBERTa-small-v1 的检查点在代码样本的编程语言分类任务(编程语言识别)上进行微调。

我们在模型的顶部添加了一个序列分类头。

在评估数据集上,我们获得了精度和F1 > 0.999,这并不奇怪,因为语言识别任务相对容易(请参考下面的直觉)。

快速开始:使用原始模型

CODEBERTA_LANGUAGE_ID = "huggingface/CodeBERTa-language-id"

tokenizer = RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID)

input_ids = tokenizer.encode(CODE_TO_IDENTIFY)
logits = model(input_ids)[0]

language_idx = logits.argmax() # index for the resulting label

快速开始:使用Pipelines ?

from transformers import TextClassificationPipeline

pipeline = TextClassificationPipeline(
    model=RobertaForSequenceClassification.from_pretrained(CODEBERTA_LANGUAGE_ID),
    tokenizer=RobertaTokenizer.from_pretrained(CODEBERTA_LANGUAGE_ID)
)

pipeline(CODE_TO_IDENTIFY)

让我们从非常简单的事情开始:

pipeline("""
def f(x):
    return x**2
""")
# [{'label': 'python', 'score': 0.9999965}]

现在让我们来检验较短的代码样本:

pipeline("const foo = 'bar'")
# [{'label': 'javascript', 'score': 0.9977546}]

如果我从赋值中移除 const 标记呢?

pipeline("foo = 'bar'")
# [{'label': 'javascript', 'score': 0.7176245}]

出于某种原因,这仍然被统计为JS代码,即使它也是有效的Python代码。然而,如果我们稍微改变一下:

pipeline("foo = u'bar'")
# [{'label': 'python', 'score': 0.7638422}]

现在这被检测为Python(注意 u 字符串修饰符)。

好吧,不要再让JS和Python占据主导地位了!让我们试试更花哨的语言:

pipeline("echo $FOO")
# [{'label': 'php', 'score': 0.9995257}]

(是的,我用“花哨”来形容PHP ?)

pipeline("outcome := rand.Intn(6) + 1")
# [{'label': 'go', 'score': 0.9936151}]

为什么语言识别问题如此简单(使用正确的工具包)?因为代码的语法是固定的,简单的标记如 :=(Go中的赋值运算符)完全可以预测底层语言:

pipeline(":=")
# [{'label': 'go', 'score': 0.9998052}]

顺便一提,由于我们在 CodeSearchNet 数据集上训练了自己的自定义分词器,并且它以一种非常通用的方式处理字节流,所以像 := 这样的语法结构由一个标记代表:

self.tokenizer.encode(" :=", add_special_tokens=False)
# [521]

微调代码

import gzip
import json
import logging
import os
from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import torch
from sklearn.metrics import f1_score
from tokenizers.implementations.byte_level_bpe import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.dataset import Dataset
from torch.utils.tensorboard.writer import SummaryWriter
from tqdm import tqdm, trange

from transformers import RobertaForSequenceClassification
from transformers.data.metrics import acc_and_f1, simple_accuracy


logging.basicConfig(level=logging.INFO)


CODEBERTA_PRETRAINED = "huggingface/CodeBERTa-small-v1"

LANGUAGES = [
    "go",
    "java",
    "javascript",
    "php",
    "python",
    "ruby",
]
FILES_PER_LANGUAGE = 1
EVALUATE = True

# Set up tokenizer
tokenizer = ByteLevelBPETokenizer("./pretrained/vocab.json", "./pretrained/merges.txt",)
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")), ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

# Set up Tensorboard
tb_writer = SummaryWriter()


class CodeSearchNetDataset(Dataset):
    examples: List[Tuple[List[int], int]]

    def __init__(self, split: str = "train"):
        """
        train | valid | test
        """

        self.examples = []

        src_files = []
        for language in LANGUAGES:
            src_files += list(
                Path("../CodeSearchNet/resources/data/").glob(f"{language}/final/jsonl/{split}/*.jsonl.gz")
            )[:FILES_PER_LANGUAGE]
        for src_file in src_files:
            label = src_file.parents[3].name
            label_idx = LANGUAGES.index(label)
            print("?", src_file, label)
            lines = []
            fh = gzip.open(src_file, mode="rt", encoding="utf-8")
            for line in fh:
                o = json.loads(line)
                lines.append(o["code"])
            examples = [(x.ids, label_idx) for x in tokenizer.encode_batch(lines)]
            self.examples += examples
        print("??")

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, i):
        # We’ll pad at the batch level.
        return self.examples[i]


model = RobertaForSequenceClassification.from_pretrained(CODEBERTA_PRETRAINED, num_labels=len(LANGUAGES))

train_dataset = CodeSearchNetDataset(split="train")
eval_dataset = CodeSearchNetDataset(split="test")


def collate(examples):
    input_ids = pad_sequence([torch.tensor(x[0]) for x in examples], batch_first=True, padding_value=1)
    labels = torch.tensor([x[1] for x in examples])
    # ^^  uncessary .unsqueeze(-1)
    return input_ids, labels


train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True, collate_fn=collate)

batch = next(iter(train_dataloader))


model.to("cuda")
model.train()
for param in model.roberta.parameters():
    param.requires_grad = False
## ^^ Only train final layer.

print(f"num params:", model.num_parameters())
print(f"num trainable params:", model.num_parameters(only_trainable=True))


def evaluate():
    eval_loss = 0.0
    nb_eval_steps = 0
    preds = np.empty((0), dtype=np.int64)
    out_label_ids = np.empty((0), dtype=np.int64)

    model.eval()

    eval_dataloader = DataLoader(eval_dataset, batch_size=512, collate_fn=collate)
    for step, (input_ids, labels) in enumerate(tqdm(eval_dataloader, desc="Eval")):
        with torch.no_grad():
            outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
            loss = outputs[0]
            logits = outputs[1]
            eval_loss += loss.mean().item()
            nb_eval_steps += 1
        preds = np.append(preds, logits.argmax(dim=1).detach().cpu().numpy(), axis=0)
        out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
    eval_loss = eval_loss / nb_eval_steps
    acc = simple_accuracy(preds, out_label_ids)
    f1 = f1_score(y_true=out_label_ids, y_pred=preds, average="macro")
    print("=== Eval: loss ===", eval_loss)
    print("=== Eval: acc. ===", acc)
    print("=== Eval: f1 ===", f1)
    # print(acc_and_f1(preds, out_label_ids))
    tb_writer.add_scalars("eval", {"loss": eval_loss, "acc": acc, "f1": f1}, global_step)


### Training loop

global_step = 0
train_iterator = trange(0, 4, desc="Epoch")
optimizer = torch.optim.AdamW(model.parameters())
for _ in train_iterator:
    epoch_iterator = tqdm(train_dataloader, desc="Iteration")
    for step, (input_ids, labels) in enumerate(epoch_iterator):
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids.to("cuda"), labels=labels.to("cuda"))
        loss = outputs[0]
        loss.backward()
        tb_writer.add_scalar("training_loss", loss.item(), global_step)
        optimizer.step()
        global_step += 1
        if EVALUATE and global_step % 50 == 0:
            evaluate()
            model.train()


evaluate()

os.makedirs("./models/CodeBERT-language-id", exist_ok=True)
model.save_pretrained("./models/CodeBERT-language-id")

CodeSearchNet 引用

@article{husain_codesearchnet_2019,
    title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
    shorttitle = {{CodeSearchNet} {Challenge}},
    url = {http://arxiv.org/abs/1909.09436},
    urldate = {2020-03-12},
    journal = {arXiv:1909.09436 [cs, stat]},
    author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
    month = sep,
    year = {2019},
    note = {arXiv: 1909.09436},
}