Release Notes#
Version 0.9#
Quantizer#
Pytorch Quantizer
Dict input/output support for model forward function
Keywords argument support for model forward function
Matmul subroutine quantization support
Support of some operations in quantizer: softmax, div, exp, clamp
Support quantization of some non-standard conv2d.
ONNX Quantizer
Add support for Float16 and BFloat16 quantization.
Add C++ kernels for customized QuantizeLinear and DequantizeLinaer operations.
Support saving quantizer version info to the quantized models’ producer field.
Support conversion of ReduceMean to AvgPool in IPU workflow.
Support conversion of BatchNorm to Conv in IPU workflow.
Support optimization of large kernel GlobalAvgPool and AvgPool operations in IPU workflow.
Supports hardware constraints check and adjustment of Gemm, Add, and Mul operations in IPU workflow.
Supports quantization for LayerNormalization, HardSigmoid, Erf, Div, and Tanh for IPU.
ONNXRuntime Execution Provider#
Support new OPs, such as Conv1d, LayerNorm, Clip, Abs, Unsqueeze, ConvTranspose.
Support pad and depad based on IPU subgraph’s inputs and outputs.
Support for U8S8 models quantized by ONNX quantizer.
Improve report summary tools.
IPU and Compiler#
Supported exp/tanh/channel-shuffle/pixel-unshuffle/space2depth
Performance uplift of xint8 output softmax
Improve the partition messages for CPU/DPU
Improve the validation check for some operators
Accelerate the speed of compiling large models
Fix the elew/pool/dwc/reshape mismatch issue and fix the stride_slice hang issue
Fix str_w != str_h issue in Conv
LLM#
Smoothquant for OPT1.3b, 2.7b, 6.7b, 13b models.
Huggingface Optimum ORT Quantizer for ONNX and Pytorch dynamic quantizer for Pytorch
Enabled Flash attention v2 for larger prompts as a custom torch.nn.Module
Enabled all CPU ops in bfloat16 or float32 with Pytorch
int32 accumulator in AIE (previously int16)
DynamicQuantLinear op support in ONNX
Support different compute primitives for prefill/prompt and token phases
Zero copy of weights shared between different op primitives
Model saving after quantization and loading at runtime for both Pytorch and ONNX
Enabled profiling prefill/prompt and token time using local copy of OPT Model with additional timer instrumentation
Added demo mode script with greedy, stochastic and contrastive search options
ASR#
Support Whipser-tiny
All GEMMs offloaded to AIE
Improved compile time
Improved WER
Known issues#
Flow control OPs including “Loop”, “If”, “Reduce” not supported by VOE
Resizing OP in ONNX opset 10 or lower is not supported by VOE
Tensorflow 2.x quantizer supports models within tf.keras.model only
Running quantizer docker in WSL on Ryzen AI laptops may encounter OOM (Out-of-memory) issue
Running multiple concurrent models using temporal sharing on the 5x4 binary is not supported
Only batch sizes of 1 are supported
Only models with the pretrained weights setting = TRUE should be imported
Launching multiple processes on 4 1x4 binaries can cause hangs, especially when models have many sub-graphs
Version 0.8#
Quantizer#
Pytorch Quantizer
Pytorch 1.13 and 2.0 support
Mixed precision quantization support, supporting float32/float16/bfloat16/intx mixed quantization
Support of bit-wise accuracy cross check between quantizer and ONNX-runtime
Split and chunk operators were automatically converted to slicing
Add support for BFP data type quantization
Support of some operations in quantizer: where, less, less_equal, greater, greater_equal, not, and, or, eq, maximum, minimum, sqrt, Elu, Reduction_min, argmin
QAT supports training on multiple GPUs
QAT supports operations with multiple inputs or outputs
ONNX Quantizer
Provided Python wheel file for installation
Support OnnxRuntime 1.15
Supports setting input shapes of random data reader
Supports random data reader in the dump model function
Supports saving the S8S8 model in U8S8 format for IPU
Supports simulation of Sigmoid, Swish, Softmax, AvgPool, GlobalAvgPool, ReduceMean and LeakyRelu for IPU
Supports node fusions for IPU
ONNXRuntime Execution Provider#
Supports for U8S8 quantized ONNX models
Improve the function of falling back to CPU EP
Improve AIE plugin framework
Supports LLM Demo
Supports Gemm ASR
Supports E2E AIE acceleration for Pre/Post ops
Improve the easy-of-use for partition and deployment
Supports models containing subgraphs
Supports report summary about OP assignment
Supports report summary about DPU subgraphs falling back to CPU
Improve log printing and troubleshooting tools.
Upstreamed to ONNX Runtime Github repo for any data type support and bug fix
IPU and Compiler#
Extended the support range of some operators
Larger input size: conv2d, dwc
Padding mode: pad
Broadcast: add
Variant dimension (non-NHWC shape): reshape, transpose, add
Support new operators, e.g. reducemax(min/sum/avg), argmax(min)
Enhanced multi-level fusion
Performance enhancement for some operators
Add quantization information validation
Improvement in device partition
User friendly message
Target-dependency check
Demos#
New Demos link: https://account.amd.com/en/forms/downloads/ryzen-ai-software-platform-xef.html?filename=transformers_2308.zip
LLM demo with OPT-1.3B/2.7B/6.7B
Automatic speech recognition demo with Whisper-tiny
Known issues#
Flow control OPs including “Loop”, “If”, “Reduce” not supported by VOE
Resize OP in ONNX opset 10 or lower not supported by VOE
Tensorflow 2.x quantizer supports models within tf.keras.model only
Running quantizer docker in WSL on Ryzen AI laptops may encounter OOM (Out-of-memory) issue
Run multiple concurrent models by temporal sharing on the Performance optimized overlay (5x4.xclbin) is not supported
Support batch size 1 only for IPU
Version 0.7#
Quantizer#
Docker Containers
Provided CPU dockers for Pytorch, Tensorflow 1.x, and Tensorflow 2.x quantizer
Provided GPU Docker files to build GPU dockers
Pytorch Quantizer
Supports multiple output conversion to slicing
Enhanced transpose OP optimization
Inspector support new IP targets for IPU
ONNX Quantizer
Provided Python wheel file for installation
Supports quantizing ONNX models for IPU as a plugin for the ONNX Runtime native quantizer
Supports power-of-two quantization with both QDQ and QOP format
Supports Non-overflow and Min-MSE quantization methods
Supports various quantization configurations in power-of-two quantization in both QDQ and QOP format.
Supports signed and unsigned configurations.
Supports symmetry and asymmetry configurations.
Supports per-tensor and per-channel configurations.
Supports bias quantization using int8 datatype for IPU.
Supports quantization parameters (scale) refinement for IPU.
Supports excluding certain operations from quantization for IPU.
Supports ONNX models larger than 2GB.
Supports using CUDAExecutionProvider for calibration in quantization
Open source and upstreamed to Microsoft Olive Github repo
TensorFlow 2.x Quantizer
Added support for exporting the quantized model ONNX format.
Added support for the keras.layers.Activation(‘leaky_relu’)
TensorFlow 1.x Quantizer
Added support for folding Reshape and ResizeNearestNeighbor operators.
Added support for splitting Avgpool and Maxpool with large kernel sizes into smaller kernel sizes.
Added support for quantizing Sum, StridedSlice, and Maximum operators.
Added support for setting the input shape of the model, which is useful in deploying models with undefined input shapes.
Add support for setting the opset version in exporting ONNX format
ONNX Runtime Execution Provider#
Vitis ONNX Runtime Execution Provider (VOE)
Supports ONNX Opset version 18, ONNX Runtime 1.16.0, and ONNX version 1.13
Supports both C++ and Python APIs(Python version 3)
Supports deploy model with other EPs
Supports falling back to CPU EP
Open source and upstreamed to ONNX Runtime Github repo
Compiler
Multiple Level op fusion
Supports the same muti-output operator like chunk split
Supports split big pooling to small pooling
Supports 2-channel writeback feature for Hard-Sigmoid and Depthwise-Convolution
Supports 1-channel GStiling
Explicit pad-fix in CPU subgraph for 4-byte alignment
Tuning the performance for multiple models
IPU#
Two configurations
Power Optimized Overlay
Suitable for smaller AI models (1x4.xclbin)
Supports spatial sharing, up to 4 concurrent AI workloads
Performance Optimized Overlay (5x4.xclbin)
Suitable for larger AI models
Known issues#
Flow control OPs including “Loop”, “If”, “Reduce” are not supported by VOE
Resize OP in ONNX opset 10 or lower not supported by VOE
Tensorflow 2.x quantizer supports models within tf.keras.model only
Running quantizer docker in WSL on Ryzen AI laptops may encounter OOM (Out-of-memory) issue
Run multiple concurrent models by temporal sharing on the Performance optimized overlay (5x4.xclbin) is not supported