Whether or not the purpose is to hurry up inferences, scale back power consumption, or enable AI fashions to run on resource-constrained {hardware} platforms, mannequin compression strategies are the go-to strategies for attaining these targets. Strategies like mannequin distillation and pruning are continuously leveraged in these eventualities, however above all, builders depend on quantization. This course of shrinks the dimensions of a mannequin’s parameters, reminiscent of weights and biases, to scale back the mannequin’s total measurement and computational complexity.
However because the outdated saying goes, you may’t get one thing for nothing. Lowering the precision of mannequin weights shouldn’t be magic. This method could also be assured to save lots of reminiscence and processing time, however there aren’t any ensures {that a} quantized mannequin will carry out in addition to the unique. And if the mannequin doesn’t carry out effectively sufficient to do its job, what worth is there in shrinking it?
To keep away from these types of points, quantization must be approached very rigorously. One choice not too long ago developed by NVIDIA is the NVFP4 knowledge sort . It’s a 4-bit floating level format that was launched alongside the Blackwell GPU structure. Regardless of being solely 4 bits in measurement, the distinctive design of NVFP4 has made it attainable for it to symbolize a large dynamic vary of tensor values, and to considerably scale back the dimensions and processing necessities of fashions with out substantial reductions in efficiency.
NVFP4 achieves this by means of a dual-scaling method that addresses one of many greatest challenges in low-bit quantization: sustaining numerical accuracy throughout a variety of values. Like many different comparable strategies, NVFP4 makes use of a fundamental E2M1 construction — 1 signal bit, 2 exponent bits, and 1 mantissa bit — however the actual innovation is in the way it scales values.
Every group of 16 values, referred to as a micro-block, shares a dynamically calculated FP8 (E4M3) scaling issue. This fractional scaling permits NVFP4 to match the unique distribution of values way more intently than earlier approaches, reminiscent of MXFP4, which used coarse, power-of-two scaling over 32-value blocks. On prime of this, a second, higher-precision FP32 scaling issue is utilized on the tensor stage to normalize knowledge additional and scale back errors.
This two-level scaling technique implies that NVFP4 can protect extra of the unique mannequin’s intelligence, even when compressed to only 4 bits. In benchmarking exams on fashions like DeepSeek-R1-0528, the accuracy drop from FP8 to NVFP4 was lower than 1% throughout a variety of duties, and in a single case, NVFP4 even outperformed FP8.
When it comes to reminiscence effectivity, NVFP4 reduces mannequin measurement by about 3.5x in comparison with FP16 and about 1.8x in comparison with FP8. This financial savings immediately interprets to improved efficiency and scalability. Moreover, due to the architectural advances in Blackwell GPUs and help for ultra-low precision operations, NVFP4 may help obtain as much as 50x higher power effectivity per token in comparison with earlier Hopper GPUs.
With instruments like NVIDIA TensorRT Mannequin Optimizer and LLM Compressor supporting the format, and prequantized fashions like Llama 3 and DeepSeek-R1-0528 already out there, builders can begin benefiting from NVFP4 in the present day. It’s a significant step ahead in making AI sooner, smaller, and greener, with out sacrificing efficiency.The 2-level scaling technique of NVFP4 (📷: NVIDIA)
NVFP4 helps a large dynamic vary of tensor values (📷: NVIDIA)
Quantization errors are diminished by the brand new approach (📷: NVIDIA)