利用最佳Intel和OpenVINO™ 加速Hugging Face Transformer模型的推理

2023年10月27日由 alex 发表 1046 0

简介

Hugging Face是一个大型的开源社区，迅速成为自然语言处理（NLP）、自动语音识别（ASR）和计算机视觉（CV）领域预训练深度学习模型的引人注目的中心。

Optimum Intel提供了一个简单的界面，用于优化Transformer模型并将其转换为OpenVINO™中间表示（IR）格式，以加速在Intel®架构上使用OpenVINO™运行时的端到端流水线。

情感分类作为一种受欢迎的NLP任务，是自动识别文本中的观点并将它们标记为积极或消极的过程。在本文中，我们以DistilBERT作为情感分类任务的示例，展示了Optimum Intel如何通过神经网络压缩框架（NNCF）来优化模型，并使用OpenVINO™运行时加速推理。

配置环境

按照以下步骤在一个新的Python虚拟环境中安装optimum-intel及其依赖项：

conda create -n optimum-intel python=3.8create -n optimum-intel python=3.8
conda activate optimum-intel
python -m pip install torch==1.9.1 onnx py-cpuinfo
python -m pip install optimum[openvino,nncf]

使用 OpenVINO™ 运行时进行模型推断

Optimum 推断模型与 Hugging Face Transformers 模型具有 API 兼容性，这意味着你可以简单地使用 "OVModelXXX" 类替换 Hugging Face Transformer 的 "AutoModelXXX" 类，以切换到使用 OpenVINO™ Runtime 进行模型推断。当使用 from_pretrained() 方法加载模型时，你可以设置 "from_transformers=True"，加载的模型将自动转换为用于 OpenVINO™ Runtime 推断的 OpenVINO™ IR。

以下是如何使用 OpenVINO™ Runtime 进行情感分类任务的推断的示例，管道的输出包括分类标签（积极/消极）和相应的置信度。

from optimum.intel.openvino import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
model_id = "distilbert-base-uncased-finetuned-sst-2-english""distilbert-base-uncased-finetuned-sst-2-english"
hf_model = OVModelForSequenceClassification.from_pretrained(
    model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
hf_pipe_cls = pipeline("text-classification",
                       model=hf_model, tokenizer=tokenizer)
text = "He's a dreadful magician."
fp32_outputs = hf_pipe_cls(text)
print("FP32 model outputs: ", fp32_outputs)

使用NNCF框架进行模型量化

大多数深度学习模型使用32位浮点精度（FP32）构建。量化是使用更少的内存来表示模型并且减少精度损失的过程。为了进一步优化模型在Intel®架构下的性能，需要对模型进行8位整数精度（INT8）的量化。

Optimum Intel可以通过使用NNCF在Hugging Face Transformer模型上应用量化。NNCF提供了两种主流的量化方法——后训练量化（PTQ）和量化感知训练（QAT）。

1. 后训练量化（PTQ）是指在没有微调的情况下，使用一个代表性的校准数据集对模型进行量化。

2. 量化感知训练（QAT）是在训练过程中模拟量化的效果，以减少对模型准确性的影响。

使用NNCF的PTQ进行模型量化

NNCF后训练静态量化引入了一个额外的校准步骤，通过将数据输入网络来计算激活量化参数。以下是如何使用通用语言理解评估（GLUE）数据集作为校准数据集对预训练的DistilBERT应用静态量化的步骤。

from functools import partial
from optimum.intel.openvino import OVQuantizer, OVConfig
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(examples, tokenizer):
    return tokenizer(
        examples["sentence"], padding=True, truncation=True, max_length=128
    )
quantizer = OVQuantizer.from_pretrained(model)
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
    preprocess_batch=True,
)
# Load the default quantization configuration
ov_config = OVConfig()
# The directory where the quantized model will be saved
save_dir = "nncf_ptq_results"
# Apply static quantization and save the resulting model in the OpenVINO IR format
quantizer.quantize(calibration_dataset=calibration_dataset,
                   save_directory=save_dir, quantization_config=ov_config)

quantize（）方法应用于训练后的静态量化，并将结果量化的模型导出为OpenVINO™ Intermediate Representation（IR），可以部署在任何目标Intel®体系架构上。

使用NNCF QAT进行模型量化

量化感知训练（QAT）旨在通过在训练期间模拟量化效果来缓解模型准确性问题。如果训练后量化导致准确性降低，则可以使用QAT。

NNCF提供了一个“OVTrainer”类，用于替代Hugging Face Transformer的“Trainer”类，以在训练期间启用附加的量化配置进行量化。以下是使用Stanford Sentiment Treebank（SST）数据集对DistilBERT进行微调，并应用量化感知训练（QAT）的示例。

import numpyThe  as np
import evaluate
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, default_data_collator
from optimum.intel.openvino import OVConfig, OVTrainer
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset("glue", "sst2")
dataset = dataset.map(
    lambda examples: tokenizer(examples["sentence"], padding=True, truncation=True, max_length=128), batched=True
)
metric = evaluate.load("accuracy")
def compute_metrics(p): return metric.compute(
    predictions=np.argmax(p.predictions, axis=1), references=p.label_ids
)
# The directory where the quantized model will be saved
save_dir = "nncf_qat_results"
# Load the default quantization configuration
ov_config = OVConfig()
trainer = OVTrainer(
    model=model,
    args=TrainingArguments(save_dir, num_train_epochs=1.0,
                           do_train=True, do_eval=True),
    train_dataset=dataset["train"].select(range(300)),
    eval_dataset=dataset["validation"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    ov_config=ov_config,
    feature="sequence-classification",
)
train_result = trainer.train()
metrics = trainer.evaluate()
trainer.save_model()

FP32和INT8模型输出的对比比较

"OVModelForXXX"类提供了相同的API，可以通过设置"from_transformers=False"来加载FP32和量化INT8 OpenVINO™模型。以下是如何加载经NNCF优化的量化INT8模型并使用OpenVINO™运行时进行推断的示例。

ov_ptq_model = OVModelForSequenceClassification.from_pretrained(“nncf_ptq_results”, from_transformers=False)
ov_ptq_pipe_cls = pipeline("text-classification", model=ov_ptq_model, tokenizer=tokenizer)"text-classification", model=ov_ptq_model, tokenizer=tokenizer)
ov_ptq_outputs = ov_ptq_pipe_cls(text)
print("PTQ quantized INT8 model outputs: ", ov_ptq_outputs)
ov_qat_model = OVModelForSequenceClassification.from_pretrained("nncf_qat_results", from_transformers=False)
ov_qat_pipe_cls = pipeline("text-classification", model=ov_qat_model, tokenizer=tokenizer)
ov_qat_outputs = ov_qat_pipe_cls(text)
print("QAT quantized INT8 model outputs: ", ov_qat_outputs)

这是FP32和INT8模型情感分类输出的一个例子：

缓解饱和引起的精度问题

对于旧的CPU代（基于SSE、AVX-2、AVX-512指令集），8位指令往往容易发生中间缓冲区饱和（溢出）问题，特别是在计算点积时，而点积是卷积或矩阵乘法操作的关键部分。当在上述架构上运行量化为8位的模型的推理时，这种饱和问题可能导致精度下降。但是在拥有Intel® Deep Learning Boost (VNNI)技术和更高版本的GPU或CPU上，这个问题不会出现。

如果使用NNCF默认的量化配置后发现精度存在显著差异（>1%），可以使用以下示例代码检查部署平台是否支持Intel® Deep Learning Boost (VNNI)和更高版本技术。

import cpuinfo
flags = cpuinfo.get_cpu_info()['flags']
brand_raw = cpuinfo.get_cpu_info()['brand_raw']
w = "without"
overflow_fix = 'enable'
for flag in flags:
    if "vnni" in flag or "amx_int8" in flag:
        w = "with"
        overflow_fix = 'disable'
print("Detected CPU platform {0} {1} support of Intel(R) Deep Learning Boost (VNNI) technology \
    and further generations, overflow fix should be {2}d".format(brand_raw, w, overflow_fix))

在量化激活时使用全范围的8位数据类型，但在旧的CPU平台上的许多模型存在饱和问题。为了减轻这个问题，有一种方法是使用7位来表示卷积或连接层的权重。

NNCF提供了三种处理饱和问题的选项。可以在NNCF量化配置中使用“overflow_fix”参数来启用这些选项：

“disable”：（默认）选项根本不应用饱和修复
“enable”：选项适用于模型中的所有层
“first_layer_only”：选项仅修复第一层的饱和问题

以下是一个示例，启用量化配置中的溢出修复选项，以减轻在旧的CPU平台上的精度问题。

from optimum.intel.openvino.configuration import DEFAULT_QUANTIZATION_CONFIGimport DEFAULT_QUANTIZATION_CONFIG
ov_config_dict = DEFAULT_QUANTIZATION_CONFIG
ov_config_dict["overflow_fix"] = "enable"
ov_config = OVConfig(compression=ov_config_dict)

通过使用 NNCF PTQ/NNCF 更新的量化配置进行模型量化后，你可以重复“FP32和INT8模型输出的对比比较”步骤，以验证量化后的INT8模型推断结果是否与FP32模型的输出一致。

文章来源：https://medium.com/openvino-toolkit/accelerate-inference-of-hugging-face-transformer-models-with-optimum-intel-and-openvino-ef1d64ee230e

标签：

人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇机器学习的数据集分割：模型训练和评估的关键

下一篇用于Transformers高效推理的联合剪枝、量化和蒸馏

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来