使用 PyTorch 从头开始构建自定义 8 位量化器并量化 facebook/opt-350m 的分步方法。
在这篇文章中,我们将使用我们最喜欢的 PyTorch 从头开始构建一个自定义 8 位量化器。我们将其称为MYQ 8 位(我的量化器),让它听起来更有趣。以下是我们实现目标的分步攻击计划。
步骤 1:构建 MYQ 8 位量化器类
量化算法在深度神经网络内部的粒度级别上如何发挥作用?让我们看看下面的图表来加深我们的理解。
如果你看一下上面的图表,我们可以用一行总结整个过程“用量化模型中的量化层替换基础模型中的线性层”。这样做很简单。让我先用指针解释整个过程,然后我们一起编写代码。
现在,我们准备编写代码来创建 QuantizedLinearLayer。我在每一行代码上都提供了注释来解释代码,希望这能帮助你更好地理解代码。
# Make sure to install these two libraries
# !pip install transformers
# !pip install -U "huggingface_hub[cli]" #For hugging face authentication
# First of all, import all the necessary libraries.
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
# Since we're taking basemodel-facebook/opt-350m from huggingface,
# we must authenticate first. Please create your own token with huggingface
!huggingface-cli login --token hf_THkbLhyIHHmluGkwwnzpXOvR##########
# Define QuantizedLinearLayer class
class QuantizedLinearLayer(nn.Module):
# As our target is to replace the linear layer from the base model. We must use the same parameters
# such as in_features, out_features, bias = True, dtype=torch.float32, The dtype is a type of bias
def __init__(self, in_features, out_features, bias = True, dtype=torch.float32):
super().__init__()
# Note that we're using self.register_buffer to store parameter variable value. This is because if we use nn.Parameter, the network will start calculating the gradient.
# We don't want that as we're not using this for training.
# weight will be initialized randomly between (-128, 127) which is a range of signed int-8
self.register_buffer("weight", torch.randint(-128, 127, (out_features, in_features)).to(torch.int8))
# scale will have dimension and data type same as the output as this will be multiplied to the output of linear layer
self.register_buffer("scale", torch.randn((out_features), dtype=dtype))
# bias is an optional parameter, so we only add it if is not none.
# bias dimension is (1, out_features) as it can later broadcasted during addition.
if bias:
self.register_buffer("bias", torch.randn((1, out_features), dtype=dtype))
else:
self.bias = None
# 8-bit quantization function
def quantize(self, weight):
# Clone the weight and outcast it to fp32 which is necessary to calculate the scale as both types must be in fp32
weight_f32 = weight.clone().to(torch.float32)
# calculating the min and max of int-8 quantized range. qmin=-128, qmax=127
Qmin = torch.iinfo(torch.int8).min
Qmax = torch.iinfo(torch.int8).max
# calculating per channel scale
# In per channel scale, you'll be calculating the scale for every row. So, you'll store the scale in a tensor in this case.)
# In per tensor scale, you'll calculate one scale for entire tensor. Per channel will be more accurate but take more memory footprint as it has to store more scale value.
# weight_f32.abs().max(dim=-1).values -> this give the max-value for original weight value range in fp32.
scale = weight_f32.abs().max(dim=-1).values/127
scale = scale.to(weight.dtype)
# This gives the quantized weight value for the given weight tensor.
# This formula was derived from symmetric quantization. please read the link I've shared above if you want to learn in detail.
quantized_weight = torch.clamp(torch.round(weight/scale.unsqueeze(1)), Qmin, Qmax).to(torch.int8)
self.weight = quantized_weight
self.scale = scale
def forward(self, input):
# This gives the output the same way as the linear function in the base model.
# The only difference is that the weight value is now the quantized weight value.
# Hence, this gives less processing by faster calculation and less memory utilization.
output = F.linear(input, self.weight.to(input.dtype)) * self.scale
if self.bias is not None:
output = output + self.bias
return output
现在,我们已经定义了 QuantizedLinearLayer 类,我们将创建一个函数,用我们的 QuantizedLinearLayer 类替换基础模型 LinearLayer 类。开始吧
def replace_linearlayer(base_model, quantizer_class, exclude_list, quantized=True):
# Finding only the instance of base_model which has the linear layer
# Also we have to make sure to exclude those linearlayer that are in the exclude list.
for name, child in base_model.named_children():
if isinstance(child, nn.Linear) and not any([x == name for x in exclude_list]):
old_bias = child.bias
old_weight = child.weight
in_features = child.in_features
out_features = child.out_features
# This is the stage where we'll initialize the quantizer class with the in_features, out_features, bias and dtype.
# The base_model parameters values are given to the quantizer class parameters.
quantizer_layer = quantizer_class(in_features, out_features, old_bias is not None, old_weight.dtype)
# After the quantizer class is initialized, The replacement takes place as below.
setattr(base_model, name, quantizer_layer)
# Now that after replacement, base_model linear layer is now a quantizer layer.
# We can now call the quantize_layers quantize function to quantize the old_weights of FP16 new quantized weights of int8 type.
if quantized:
getattr(base_model, name).quantize(old_weight)
# If bias is not none, we'll also update bias with the base model bias value
if old_bias is not None:
getattr(base_model, name).bias = old_bias
# If the base model child instance has further sub-components with linear layers, we'll have to quantize them by calling the replace_linear_layer function with the child as base_model now.
# This will replace all the linear layers with quantized layers that are under the child sub-section.
else:
replace_linearlayer(child, quantizer_class, exclude_list, quantized=quantized)
步骤 2:现在我们已经创建了量化器,让我们从hugging face获取基础模型(facebook/opt-350m)。请确保拥有hugging face的账户,并使用自己的授权令牌。过程很简单,而且免费。
# Take note that we'll be downloading the model in bfloat16 instead of its original fp32 datatype.
# This will reduce the size of the base model and our quantization time later.
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.bfloat16)
print("facebook/opt-350m: base model architecture before quantization")
print("-"*50)
print(model)
在上述基本模型结构中,蓝色虚线框内的所有线性层都将替换为量化层,而绿色虚线框内的层将被排除在外。LLM 模型由许多相互连接的转换层组成,每一层的输出层都是下一层的输入层。因此,量化最外层可能会影响模型的整体精度。因此,更好的办法是将其排除在外。
让我们检查一下量化前基础模型的大小。其大小为 0.66 GB(662 MB)。
# check the size of this base model before quantization
model_memory_size_before_quantization = model.get_memory_footprint()
print(f"Total memory size before quantization (in GB): {model_memory_size_before_quantization / 1e+9}")
让我们对这个 facebook/opt-350m 基本模型进行推理。
# Let's perform inference on this facebook/opt-350m base model
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
pipe("Malaysia is a beautiful country and ", max_new_tokens=50)
步骤 3:现在我们运行 replace_linearlayer 函数。它将完成两项重要任务。
让我们编写一段运行 replace_linearlayer 函数的代码。
# model: base_model, QuantizedLinearLayer: quantized layer we've created in step 1, ["lm_head"]: exclude list
# quantized=True: If we set quantized value to False, the quantizer will only replace the linear layer with quantized layer but it won't quantized the weight.
# We'll need this if we're to save the quantized model to say huggingface or other cloud provider.
# Later, any user can download this quantized model and create the base model skeleton in order to load the model.
replace_linearlayer(model, QuantizedLinearLayer, ["lm_head"], quantized=True)
print("facebook/opt-350m: quantized model architecture")
print("-"*50)
print(model)
在上面的量化模型架构中,你可以看到蓝色方框中的所有线性层都被替换为量化线性层,而绿色虚线方框中的线性层仍保持不变。
现在让我们检查一下这个量化模型的大小。大小为 0.35 GB(359 MB)。这比基础模型的大小减少了 54%。这真是太棒了。
model_memory_size_after_quantization = model.get_memory_footprint()
print(f"Total memory size after quantization (in GB): {model_memory_size_after_quantization / 1e+9}")
然后,让我们对这个量化模型进行推理。
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)"text-generation", model=model, tokenizer=tokenizer)
pipe("Malaysia is a beautiful country and ", max_new_tokens=50)
令人惊讶的是,量化模型中的推理给出了相同的精确度。这令人印象深刻。
步骤 4:MYQ 4 位量化器
要建立一个 4 位量化器,除了我们之前完成的所有任务外,你还需要实施一项新技术。这项技术叫做权重打包和权重解包。目前,PyTorch 不支持 4 位、2 位或任何小于 Int-8 的数据。因此,我们必须使用加权打包技术来实现我们的目标。
权重打包: 如果我们要将 4 位编码的权重参数值存储为 int8 数据类型,其内存占用与实际的 8 位编码张量相同。因此,我们必须找到一种方法,让 4 位编码值只分配 4 位内存空间。这样,如果我们用 4 位量化,整个量化模型的内存占用仍然比 8 位量化模型小将近一半。因此,加权打包技术可以帮助我们实现这一目标。在加权打包技术中,我们会将多个 4 位编码值添加到一个 8 位张量中,只要它能满足要求即可。这样,4 位编码值只分配 4 位空间,其余空间则由另一个 4 位编码值利用。
权重解包: 在推理过程中,模型只处理浮点数值。因此,我们必须使用权重拆包技术,将每个打包的权重张量拆分成单独的 4 位编码值,并单独分配给 Int8 数据类型。每个 4 位编码值都有 8 位内存空间,然后我们就可以将它们转换为 fp32 以进行推理。
总结