A Coding Implementation on Introduction to Weight Quantization: Key Side in Enhancing Effectivity in Deep Studying and LLMs

April 13, 2025

45

In at present’s deep studying panorama, optimizing fashions for deployment in resource-constrained environments is extra necessary than ever. Weight quantization addresses this want by lowering the precision of mannequin parameters, usually from 32-bit floating level values to decrease bit-width representations, thus yielding smaller fashions that may run quicker on {hardware} with restricted assets. This tutorial introduces the idea of weight quantization utilizing PyTorch’s dynamic quantization method on a pre-trained ResNet18 mannequin. The tutorial will discover the right way to examine weight distributions, apply dynamic quantization to key layers (comparable to absolutely related layers), examine mannequin sizes, and visualize the ensuing adjustments. This tutorial will equip you with the theoretical background and sensible abilities required to deploy deep studying fashions.

import torch
import torch.nn as nn
import torch.quantization
import torchvision.fashions as fashions
import matplotlib.pyplot as plt
import numpy as np
import os


print("Torch model:", torch.__version__)

We import the required libraries comparable to PyTorch, torchvision, and matplotlib, and prints the PyTorch model, guaranteeing all needed modules are prepared for mannequin manipulation and visualization.

model_fp32 = fashions.resnet18(pretrained=True)
model_fp32.eval()  


print("Pretrained ResNet18 (FP32) mannequin loaded.")

A pretrained ResNet18 mannequin is loaded in FP32 (floating-point) precision and set to analysis mode, getting ready it for additional processing and quantization.

fc_weights_fp32 = model_fp32.fc.weight.information.cpu().numpy().flatten()


plt.determine(figsize=(8, 4))
plt.hist(fc_weights_fp32, bins=50, colour="skyblue", edgecolor="black")
plt.title("FP32 - FC Layer Weight Distribution")
plt.xlabel("Weight values")
plt.ylabel("Frequency")
plt.grid(True)
plt.present()

On this block, the weights from the ultimate absolutely related layer of the FP32 mannequin are extracted and flattened, then a histogram is plotted to visualise their distribution earlier than any quantization is utilized.

quantized_model = torch.quantization.quantize_dynamic(model_fp32, {nn.Linear}, dtype=torch.qint8)
quantized_model.eval()  


print("Dynamic quantization utilized to the mannequin.")

We apply dynamic quantization to the mannequin, particularly focusing on the Linear layers—to transform them to lower-precision codecs, demonstrating a key method for lowering mannequin measurement and inference latency.

def get_model_size(mannequin, filename="temp.p"):
    torch.save(mannequin.state_dict(), filename)
    measurement = os.path.getsize(filename) / 1e6
    os.take away(filename)
    return measurement


fp32_size = get_model_size(model_fp32, "fp32_model.p")
quant_size = get_model_size(quantized_model, "quant_model.p")


print(f"FP32 Mannequin Measurement: {fp32_size:.2f} MB")
print(f"Quantized Mannequin Measurement: {quant_size:.2f} MB")

A helper perform is outlined to save lots of and test the mannequin measurement on disk; then, it’s used to measure and examine the sizes of the unique FP32 mannequin and the quantized mannequin, showcasing the compression impression of quantization.

dummy_input = torch.randn(1, 3, 224, 224)


with torch.no_grad():
    output_fp32 = model_fp32(dummy_input)
    output_quant = quantized_model(dummy_input)


print("Output from FP32 mannequin (first 5 components):", output_fp32[0][:5])
print("Output from Quantized mannequin (first 5 components):", output_quant[0][:5])

A dummy enter tensor is created to simulate a picture, and each FP32 and quantized fashions are run on this enter with the intention to examine their outputs and validate that quantization doesn’t drastically alter predictions.

if hasattr(quantized_model.fc, 'weight'):
    fc_weights_quant = quantized_model.fc.weight().dequantize().cpu().numpy().flatten()
else:
    fc_weights_quant = quantized_model.fc._packed_params._packed_weight.dequantize().cpu().numpy().flatten()


plt.determine(figsize=(14, 5))


plt.subplot(1, 2, 1)
plt.hist(fc_weights_fp32, bins=50, colour="skyblue", edgecolor="black")
plt.title("FP32 - FC Layer Weight Distribution")
plt.xlabel("Weight values")
plt.ylabel("Frequency")
plt.grid(True)


plt.subplot(1, 2, 2)
plt.hist(fc_weights_quant, bins=50, colour="salmon", edgecolor="black")
plt.title("Quantized - FC Layer Weight Distribution")
plt.xlabel("Weight values")
plt.ylabel("Frequency")
plt.grid(True)


plt.tight_layout()
plt.present()

On this block, the quantized weights (after dequantization) are extracted from the absolutely related layer and in contrast by way of histograms in opposition to the unique FP32 weights as an instance the adjustments in weight distribution as a result of quantization.

In conclusion, the tutorial has supplied a step-by-step information to understanding and implementing weight quantization, highlighting its impression on mannequin measurement and efficiency. By quantizing a pre-trained ResNet18 mannequin, we noticed the shifts in weight distributions, the tangible advantages in mannequin compression, and potential inference pace enhancements. This exploration units the stage for additional experimentation, comparable to implementing Quantization Conscious Coaching (QAT), which might additional optimize efficiency on quantized fashions.

Right here is the Colab Pocket book. Additionally, don’t overlook to comply with us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Overlook to affix our 85k+ ML SubReddit.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.