Model Quantization#
Model quantization is the process of mapping high-precision weights/activations to a lower precision format, such as BF16/INT8, while maintaining model accuracy. This technique enhances the computational and memory efficiency of the model for deployment on NPU devices. It can be applied post-training, allowing existing models to be optimized without the need for retraining.
The Ryzen AI compiler supports input models quantized to either INT8 or BF16 format:
CNN models: INT8 or BF16
Transformer models: BF16
Quantization introduces several challenges, primarily revolving around the potential drop in model accuracy. Choosing the right quantization parameters—such as data type, bit-width, scaling factors, and the decision between per-channel or per-tensor quantization—adds layers of complexity to the design process.
AMD Quark#
AMD Quark is a comprehensive cross-platform deep learning toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.
For more challenging model quantization needs AMD Quark supports advanced quantization technique like Fast Finetuning that helps recover the lost accuracy of the quantized model.
Documentation#
The complete documentation for AMD Quark for Ryzen AI can be found here: https://quark.docs.amd.com/latest/supported_accelerators/ryzenai/index.html
INT8 Examples#
AMD Quark Tutorial for Ryzen AI Deployment
Running INT8 model on NPU using Getting Started Tutorial
Advanced quantization techniques Fast Finetuning and Cross Layer Equalization for INT8 model
BF16 Examples#
Image Classification using ResNet50 to run BF16 model on NPU
Advanced quantization techniques Fast Finetuning for BF16 models.