
I simply downloaded the most recent 4 Billion parameter mannequin. I hit ‘Run‘. After some time, the Google Colab occasion crashes. Sounds acquainted? Properly that is sure to occur if we don’t take note of the required VRAM and what VRAM we’re offering to the mannequin. Quantization is one thing that may enable you sort out this drawback, and that is precisely what we shall be protecting on this weblog; we will even learn to calculate the VRAM necessities of the mannequin, find out about a number of quantization methods and the options to deal with these actually massive language fashions.
Parameters vs. Mannequin Measurement
The parameter depend is crucial to measure a mannequin’s footprint however we should always not overlook concerning the precision of weights of the mannequin (Be aware: Weights of a mannequin are the parameters). A easy option to estimate the mannequin’s VRAM is {No. of Parameters x Precision (in Bytes)}.
Instance: If we now have a mannequin with 300M parameters and the weights are saved in 32-bit precision. This implies there are (300 X 10^6) * (4 Bytes) = 1.2 GB. So roughly this mannequin will want 1.5 GB VRAM.
Be aware: 1 Byte = 8 Bits
What’s Mannequin Quantization?
Quantization reduces the precision of a mannequin’s weights whereas aiming to maintain efficiency roughly the identical. This usually can shrink the mannequin dimension by 2× or extra. The mannequin efficiency is in fact affected however not by a lot if we carry out the proper quantization and take a look at the outcomes.
Instance:high-precision numbers (like 32-bit floats) to lower-precision buckets (like 4-bit integers).
- Half-Precision (BF16/FP16): Reduces reminiscence by 50% with nearly zero loss in accuracy.
- Deep Quantization (INT8/INT4): Reductions of 75% or extra. That is how we are able to match massive fashions onto shopper {hardware}.

Mannequin Quantization in Motion
On this part, we goal to carry out quantization with the assistance of PyTorch utilizing a Google Colab Occasion. We are going to run inference with the Mistral-3 (14B) by quantizing and loading the mannequin by means of HuggingFace transformers.
Be aware: This mannequin wants 14 x 10
Pre-Requisites
- We are going to take assist from Hugging Face and Google colab for this demo. And we shall be utilizing a Gemma-3 mannequin which is a gated mannequin. Ensure this get the permission from right here after logging-in.

- Create a Hugging Face token that we’ll later use from right here.

Be aware: Ensure to examine the ‘Learn entry to contents of all public gated repos you possibly can entry’ choice.
- Google Colab occasion with T4 GPU:
Be sure to change the run time kind to T4 GPU in a brand new Colab pocket book.

Quantizing Fashions to bfloat16
Installations
!pip set up -U transformers speed up
Enter your Hugging Face Token
!hf auth login
Paste the Hugging Face key when prompted.

Be aware: You possibly can kind ‘n’ for Add token as git credential.
Imports
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Picture
import requests
import torch
Loading the mannequin
import torch
desired_dtype = torch.bfloat16
torch.set_default_dtype(desired_dtype)
model_id = "google/gemma-3-4b-it"
mannequin = Gemma3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto"
).eval()
Be aware: setting the dtype right here will quantize the mannequin and alter the default precision of float32 to float16.
Trying on the mannequin info
- Parameters and their Dtypes within the mannequin:
for title, param in mannequin.named_parameters():
print(f"{title}: {param.dtype}")
break
Output:
mannequin.vision_tower.vision_model.embeddings.patch_embedding.weight: torch.bfloat16
Be aware: You possibly can take away the break to see all of the layers, additionally you possibly can see that our parameters at the moment are in ‘bfloat16’
- Mannequin footprint:
print("Footprint of the fp16 mannequin in GBs: ", mannequin.get_memory_footprint()/1e+9)
Output:
Footprint of the fp16 mannequin in GBs: 8.600192738
Be aware: The footprint shall be 17.200351684 GB if we don’t quantize the mannequin, this may possible not run on the Colab occasion we created.
Working Inference
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"position": "person",
"content material": [{"type": "text", "text": "Explain how a transformer works."}]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(mannequin.machine)
input_len = inputs["input_ids"].form[-1]
with torch.inference_mode():
technology = mannequin.generate(
**inputs,
max_new_tokens=100,
do_sample=False
)
technology = technology[0][input_len:]
decoded = processor.decode(technology, skip_special_tokens=True)
print(decoded)
Output:
print(decoded)Okay, right here’s a fast rationalization of how a transformer works:
A transformer makes use of electromagnetic induction to alter voltage ranges.
Would you want me to delve into a particular facet, like how
Nice! We efficiently ran the inference on the quantized mannequin and bought good outcomes. Now let’s attempt to quantize the mannequin even additional.
Quantizing Fashions even additional
Installations
!pip set up -U bitsandbytes
Be aware: Set up this together with the older installations when you have began a brand new occasion.
Imports
from transformers import AutoProcessor, Gemma3ForConditionalGeneration, BitsAndBytesConfig
from PIL import Picture
import requests
import torch
Loading the mannequin
model_id = "google/gemma-3-4b-it"
# Optimized 4-bit configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True # Quantizes the constants
)
mannequin = Gemma3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16 # Essential for Gemma stability
)
Be aware: nf4 is a knowledge kind for extremely environment friendly low-bit quantization that we’re utilizing. We’re configuring the calculations below the hood in ‘bfloat16’ for efficiency.
Parameters and Measurement of the mannequin
for title, param in mannequin.named_parameters():
print(f'{title}: {param.dtype}')
Output:
mannequin.vision_tower.vision_model.encoder.layers.2.layer_norm1.weight: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.layer_norm1.bias: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight: torch.uint8
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight: torch.uint8
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight: torch.uint8
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight: torch.uint8
mannequin.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.layer_norm2.weight: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.layer_norm2.bias: torch.bfloat16
mannequin.vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight: torch.uint8
Discover one thing attention-grabbing? All layers usually are not scaled down. It is because the bitsandbytes quantization in transformers quantized the parameters after which takes two 4-bit weights and packs them right into a single torch.uint8 container. Others are quantized to ‘bfloat16’.
print("Footprint of the mannequin in GBs: ",
mannequin.get_memory_footprint()/1e+9)
Output:
Footprint of the mannequin in GBs: 3.170623202
Nice! The scale of the mannequin has been drastically diminished.
Working Inference
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"position": "person",
"content material": [{"type": "text", "text": "Explain how a transformer works in 60-80 words."}]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(mannequin.machine)
input_len = inputs["input_ids"].form[-1]
with torch.inference_mode():
technology = mannequin.generate(
**inputs,
max_new_tokens=100,
do_sample=False
)
technology = technology[0][input_len:]
decoded = processor.decode(technology, skip_special_tokens=True)
print(decoded)
Output
print(decoded)Okay, right here’s a breakdown of how a Transformer works in roughly 60-80 phrases:
Transformers are neural networks that excel at processing sequential information like textual content.
Basically, the mannequin concurrently considers all enter phrases, understanding their...
Properly, the context captured by each the quantized fashions fluctuate however we don’t see any noticeable hallucination in each the responses.
Options to Normal Quantization
Listed below are some options you possibly can put to make use of as an alternative of normal quantization:
- AWQ (Activation-aware Weight Quantization): This methodology protects a very powerful weights throughout compression. In PyTorch, we are able to load these utilizing AutoAWQForCausalLM (utilizing awq library) inside Hugging Face’s from_pretrained methodology to make sure good accuracy with 4-bit weights.
- GGUF (Generalized Gradient Replace Framework): Hugging Face transformers now help GGUF natively. Utilizing the gguf you possibly can carry out Layer Offloading, splitting the mannequin between VRAM and system RAM to run huge fashions on restricted {hardware}.
- QLoRA: Sure, I’m suggesting fine-tuning your mannequin. As a substitute of struggling to run a large 8 Billion mannequin, fine-tuning a 3 Billion mannequin in your particular information may be higher. A site-specific mannequin typically outperforms a common mannequin whereas utilizing a lot lesser reminiscence.
Additionally Learn: High 15+ Cloud GPU Suppliers For 2026
Conclusion
Subsequent time you hit Run on a large mannequin, don’t let your Google Colab occasion crash. By studying the connection between parameter depend and weight precision, you possibly can roughly calculate the reminiscence footprint required. Whether or not by means of bfloat16 or deep 4-bit quantization, shrinking mannequin dimension is not a thriller. You now have the instruments and concepts to deal with massive fashions with ease. Additionally bear in mind to check your fashions on customary datasets to judge their efficiency.
Often Requested Questions
A. View CPU particulars utilizing !lscpu and GPU standing through !nvidia-smi. Alternatively, click on the RAM/Disk standing bar (on the right-top) to see your present {hardware} useful resource allocation and utilization.
A. Consider LLMs utilizing MMLU for information, GSM8K for math, HumanEval for coding, and TruthfulQA. Use a domain-specific dataset if you happen to’re evaluating a fine-tuned mannequin.
A. QLoRA is an environment friendly fine-tuning methodology that makes use of 4-bit quantization to scale back reminiscence utilization whereas sustaining efficiency by coaching small adapter layers on prime.
Login to proceed studying and revel in expert-curated content material.

