Model Deployment#

Windows Machine Learning (WinML) enables C#, C++, and Python developers to run ONNX AI models locally on Windows PCs through ONNX Runtime, with automatic execution provider management across hardware targets including CPUs, GPUs, and NPUs. You can use models from PyTorch, TensorFlow/Keras, TensorFlow Lite (TFLite), scikit-learn, and other frameworks by converting them to ONNX for ONNX Runtime.

In short, Windows ML provides a shared, Windows-wide ONNX Runtime along with support for dynamically downloading execution providers (EPs).

For more details, see the Windows ML official documentation.

Running CNN/Transformer models on NPU#

Windows ML provides a streamlined workflow for deploying CNN and Transformer models on Ryzen AI PCs. Users can either use the original float model with automatic BF16 conversion, or use AI Toolkit for model quantization (QDQ format).

Windows ML workflow#

../_images/winml-workflow.png

Step 1: Download the Original Float Model

Start with your pre-trained ONNX model in FP32 format. Models can be exported from PyTorch, TensorFlow, or obtained from model repositories.

Step 2: Model Quantization using VS AI Toolkit (Optional)

For improved inference performance, quantize your model using VS AI Toolkit or Olive recipe:

  • A8W8 quantization: Recommended for CNN models (ResNet, MobileNet, etc.)

  • A16W8 quantization: Recommended for Transformer models (BERT, CLIP etc.)

Step 3: Automatic Execution Provider Registration

Windows ML automatically downloads and registers the appropriate execution providers based on available hardware:

Execution Provider

Hardware Target

VitisAIExecutionProvider

AMD Ryzen AI NPU

MIGraphXExecutionProvider

AMD GPU (ROCm)

DmlExecutionProvider

DirectML (GPU/NPU)

Step 4: Execution Policy for device selection

Select the preferred execution target using the execution policy:

Execution Policy

First Preference EP

PREFER_CPU

CPUExecutionProvider

PREFER_GPU

DmlExecutionProvider

PREFER_NPU

VitisAIExecutionProvider

The EP selection policy can be configured to use a specific execution provider or through general execution policy. For more details, refer to the Windows ML documentation on Execution Providers.

Step 5: Model Compilation

The model is compiled for the target hardware:

  • Float models: VAIML performs automatic BF16 conversion for NPU execution

  • Quantized models: A8W8/A16W8 models are compiled using X2/X1 compiler

For more details refer to the model compilation and deployment documentation.

Step 6: Model Inference

Use the ONNX Runtime APIs to run inference on the compiled model. The model will execute on the selected hardware target based on the execution policy and available EPs.

Running LLM models on NPU#

Windows ML enables support for Foundry Local models for on-device AI inference solutions that provide privacy and performance. Currently, Foundry Local is available in preview mode. It automatically detects NPU and downloads the compatible model for the NPU device.

LLM prerequisites#

Make sure the following requirements are met before proceeding:

Requirement

Details

Operating System

Windows 10, Windows 11

Hardware (Minimum)

8 GB RAM, 3 GB free disk space

Hardware (Recommended)

16 GB RAM, 15 GB free disk space

Acceleration

AMD NPU

Running LLM on AMD NPU#

LLM models can be run on AMD NPU using Foundry Local or Windows ML APIs. Foundry Local provides an easy-to-use interface for running LLM models on AMD NPU, while Windows ML APIs allow for more customization and control over the inference process.

Option 1: Running LLM using Foundry Local

This is the recommended option for most users as it provides a simple and efficient way to run LLM models on AMD NPU without needing to manage dependencies or optimize the model manually.

Option 2: Running a Custom LLM Model using Windows ML and OGA APIs

This option allows users to run custom LLM models on AMD NPU using Windows ML APIs. This option is suitable for users who want more control over the inference process and are comfortable managing dependencies and model optimization manually.

For detailed instructions on each option, see the Running LLM Models on NPU documentation.