Those passionate about technology are always looking to explore, pushing the boundaries of what is possible. They are always looking to erase the constraint and keep scaling the mountain. The higher you get, i.e., the more constraints you erase, the more beautiful the possibilities of leveraging technology are. And it will be a crime not to keep exploring the options to uncover the full beauty of technology.
When it comes to Edge AI, the realm of unique possibilities is currently constrained both on the hardware and software side. But true to the human spirit, nerds worldwide are constantly pushing to erase those constraints. From enhancing edge device hardware capabilities to developing edge-optimized LLM supplications, efforts are always underway. One of the approaches to reduce model size, so that they can be effectively deployed on the edge, is quantization.
Before you read ahead, I suggest you watch Episode 23 of “Edge AI Bytes” for a high-level overview of quantization. This will help you make sense of the rest of this article.
What exactly is quantization?
So, let us get a bit more technical here. Quantization is a technique for executing computations and storing tensors at lower bit widths than floating-point precisions. So, essentially, a quantized model executes operations on tensors with lower precision vs. full floating point precision values. This makes the model representation compact and allows the use of high-performance vectorized operations on a plethora of hardware platforms. If you want the details on performing quantization, this post can help you understand it from a PyTorch context.
Quantization approaches
Two types of quantization approaches are leveraged in model quantization. We will overview both these approaches:
- Quantization-Aware Training (QAT)
- Post-Training Quantization (PTQ)
Quantization-aware training (QAT) is a method that applies quantization during the training process. In the QAT approach, the model is trained with quantization-aware operations that emulate the quantization process during training. This equips the model to learn how to perform better in the quantized representation as well, which translates into improved model accuracy, as compared to PTQ.
Post-training quantization (PTQ) is a method where the quantization process is applied to the trained model after the model has already been trained. In this approach, a trained model’s weights and activations are quantized from high precision to low precision (For example, from FP32 to INT8). Though this approach is relatively simpler to implement, accuracy is impacted since it does not account for the impact of quantization during the training process.
Quantization types
We have overviewed the levels of quantization so far. Another decision factor during quantization is which quantization method you want to leverage. This depends on what quantization type will deliver the most optimal result for your needs. The three fundamental types are:
- In Naive quantization, the precision of all the operators is quantized to INT8 and is calibrated using the same methodology.
- In the hybrid quantization type, some operators are quantized to lower precision, like INT8, but some are left in their original data type representations, like FP16 and FP32.
- In selective quantization, multiple approaches work in tandem. Some operators are quantized to INT8 with varying calibration methods and granularity levels, and residuals are quantized to INT8, whereas sensitive layers remain at FP16.
The choice of which method you want to use depends on your unique requirements. However, the exciting aspect here is that quantization, combined with other advances, will help us bring some really beautiful applications of EdgeAI to life.

