ONNX Quantization#

Installation#

If you have prepared your working environment using the bundled installation script, then Vitis AI ONNX quantizer is already installed.

Otherwise, ensure that Vitis AI ONNX Quantizer is correctly installed as per ONNX Quantizer installation instructions.

Overview#

The ONNX quantization supports Post Training Quantization. This static quantization method first runs the model using a set of inputs called calibration data. During these runs, the flow computes the quantization parameters for each activation. These quantization parameters are written as constants to the quantized model and used for all inputs. The quantization tool supports the following calibration methods: MinMax, Entropy and Percentile, and MinMSE.

Running vai_q_onnx#

Quantization in ONNX Runtime refers to the linear quantization of an ONNX model. We have developed the vai_q_onnx tool as a plugin for ONNX Runtime to support more post-training quantization(PTQ) functions for quantizing a deep learning model. Post-training quantization(PTQ) is a technique to convert a pre-trained float model into a quantized model with little degradation in model accuracy. A representative dataset is needed to run a few batches of inference on the float model to obtain the distributions of the activations, which is also called quantized calibration.

Note

The ONNX models must be opset 10 or higher to be quantized by Vitis AI ONNX Quantizer. Models with opset < 10 must be reconverted to ONNX from their original framework using opset 10 or above. Alternatively, you can refer to the usage of the version converter for ONNX Version Converter onnx/onnx

Use the following steps to run PTQ with vai_q_onnx.

1. Preparing the Float Model and Calibration Set#

Before running vai_q_onnx, prepare the float model and calibration set, including the files as listed

  • float model Floating-point ONNX models in onnx format.

  • calibration dataset A subset of the training dataset or validation dataset to represent the input data distribution, usually 100 to 1000 images are enough.

3. Quantizing Using the vai_q_onnx API#

The static quantization method first runs the model using a set of inputs called calibration data. During these runs, we compute the quantization parameters for each activation. These quantization parameters are written as constants to the quantized model and used for all inputs. Vai_q_onnx quantization tool has expanded calibration methods to power-of-2 scale/float scale quantization methods. Float scale quantization methods include MinMax, Entropy, and Percentile. Power-of-2 scale quantization methods include MinMax and MinMSE.

vai_q_onnx.quantize_static(
   model_input,
   model_output,
   calibration_data_reader,
   quant_format=vai_q_onnx.VitisQuantFormat.FixNeuron,
   calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
   input_nodes=[],
   output_nodes=[],
   extra_options=None,)

Arguments

model_input: (String) This parameter specifies the file path of the model that is to be quantized.

model_output: (String) This parameter specifies the file path where the quantized model will be saved.

calibration_data_reader: (Object or None) This parameter is a calibration data reader that enumerates the calibration data and generates inputs for the original model. If you wish to use random data for a quick test, you can set calibration_data_reader to None.

quant_format: (Enum) This parameter defines the quantization format for the model. It has the following options:

  • QOperator This option quantizes the model directly using quantized operators.

  • QDQ This option quantizes the model by inserting QuantizeLinear/DeQuantizeLinear into the tensor. It supports 8-bit quantization only

  • VitisQuantFormat.QDQ This option quantizes the model by inserting VAIQuantizeLinear/VAIDeQuantizeLinear into the tensor. It supports a wider range of bit-widths and configurations.

  • VitisQuantFormat.FixNeuron This option quantizes the model by inserting FixNeuron (a combination of QuantizeLinear and DeQuantizeLinear) into the tensor. This is the default value.

calibrate_method: (Enum) This parameter is used to set the power-of-2 scale quantization method. It currently supports two methods:

  • ‘vai_q_onnx.PowerOfTwoMethod.NonOverflow’

  • ‘vai_q_onnx.PowerOfTwoMethod.MinMSE’ (default)

input_nodes: (List of Strings) This parameter is a list of the names of the starting nodes to be quantized. Nodes in the model before these nodes will not be quantized. For example, this argument can be used to skip some pre-processing nodes or stop the first node from being quantized. The default value is an empty list ([]).

output_nodes: (List of Strings) This parameter is a list of the names of the end nodes to be quantized. Nodes in the model after these nodes will not be quantized. For example, this argument can be used to skip some post-processing nodes or stop the last node from being quantized. The default value is an empty list ([]).

extra_options: (Dict or None) This parameter is a dictionary of additional options that can be passed to the quantization process. If there are no additional options to provide, this can be set to None. The default value is None.