Overview#
OGA-based Flow with Hybrid Execution#
Ryzen AI Software is the best way to deploy quantized 4-bit LLMs on Ryzen AI 300-series PCs. This solution uses a hybrid execution mode, which leverages both the NPU and integrated GPU (iGPU), and is built on the OnnxRuntime GenAI (OGA) framework.
Hybrid execution mode optimally partitions the model such that different operations are scheduled on NPU vs. iGPU. This minimizes time-to-first-token (TTFT) in the prefill-phase and maximizes token generation (tokens per second, TPS) in the decode phase.
OGA is a multi-vendor generative AI framework from Microsoft that provides a convenient LLM interface for execution backends such as Ryzen AI.
Supported Configurations#
Only Ryzen AI 300-series Strix Point (STX) and Krackan Point (KRK) processors support OGA-based hybrid execution.
Developers with Ryzen AI 7000- and 8000-series processors can get started using the CPU-based examples linked in the Supported LLMs table.
Windows 11 is the required operating system.
Development Interfaces#
Note
Only the OGA APIs interface provides support for DeepSeek-R1-Distill models at this time.
The Ryzen AI LLM software stack is available through three development interfaces, each suited for specific use cases as outlined in the sections below. All three interfaces are built on top of native OnnxRuntime GenAI (OGA) libraries, as shown in the Ryzen AI Software Stack diagram below.
The high-level Python APIs, as well as the Server Interface, also leverage the lemonade
SDK, which is multi-vendor open-source software that provides everything necessary for quickly getting started with LLMs on OGA.
A key benefit of both OGA and lemonade
is that software developed against their interfaces is portable to many other execution backends.
Your Python Application |
Your LLM Stack |
Your Native Application |
---|---|---|
* indicates open-source software (OSS).
High-Level Python SDK#
The high-level Python SDK, lemonade
, can get you started in under 5 minutes with PyPI installation.
This SDK is the fastest way to:
Experiment with models in hybrid execution mode on Ryzen AI hardware.
Validate inference speed and task performance.
Integrate with Python apps using a high-level API.
To get started in Python, follow these instructions: High-Level Python SDK.
Server Interface (REST API)#
The Server Interface provides a convenient means to integrate with applications that:
Already support an LLM server interface, such as the ollama server or OpenAI API.
Are written in any language (C++, C#, Javascript, etc.) that supports REST APIs.
Benefits from process isolation for the LLM backend.
To get started with the server interface, follow these instructions: Server Interface (REST API).
OGA APIs for C++ Libraries and Python#
Native C++ libraries for OGA are available to give full customizability for deployment into native applications.
The Python bindings for OGA also provide a customizable interface for Python development.
To get started with the OGA APIs, follow these instructions OGA API for C++ and Python.
Supported LLMs#
The following tables contain a comprehensive list of all LLMs that have been validated on Ryzen AI hybrid execution mode. The hybrid examples are built on top of OnnxRuntime GenAI (OGA).
The pre-optimized models for hybrid execution used in these examples are available in the AMD hybrid collection on Hugging Face. It is also possible to run fine-tuned versions of the models listed (for example, fine-tuned versions of Llama2 or Llama3). For instructions on how to prepare a fine-tuned OGA model for hybrid execution, refer to Preparing Models.
Ryzen AI Hybrid (OGA int4, ISL = 1024) |
||||
---|---|---|---|---|
Model |
Instructions |
TTFT [s] |
TPS [tok/s] |
Validation |
0.68 |
60.0 |
🟢 |
||
2.64 |
20.1 |
🟢 |
||
2.68 |
19.2 |
🟢 |
CPU Baseline (HF bfloat16) |
Ryzen AI Hybrid (OGA int4) |
|||||
---|---|---|---|---|---|---|
Model |
Example |
Validation |
Example |
TTFT Speedup |
Tokens/S Speedup |
Validation |
🟢 |
2.7x |
5.2x |
🟢 |
|||
🟢 |
2.7x |
8.5x |
🟢 |
|||
🟢 |
3.9x |
7.7x |
🟢 |
|||
🟢 |
2.9x |
7.6x |
🟢 |
|||
🟢 |
4.4x |
9.7x |
🟢 |
|||
🟢 |
4.0x |
7.9x |
🟢 |
|||
🟢 |
4.8x |
8.3x |
🟢 |
|||
🟢 |
5.1x |
8.1x |
🟢 |
|||
🟢 |
4.4x |
9.3x |
🟢 |
|||
🟢 |
4.0x |
9.1x |
🟢 |
The lemonade SDK table was compiled using validation, benchmarking, and accuracy metrics as measured by the ONNX TurnkeyML v6.0.0 lemonade
commands in each example link.
Data collection details:
All validation, performance, and accuracy metrics are collected on the same system configuration:
System: HP OmniBook Ultra Laptop 14z
Processor: AMD Ryzen AI 9 HX 375 W/ Radeon 890M
Memory: 32GB of RAM
The Hugging Face
transformers
framework is used as the baseline implementation for speedup and accuracy comparisons.The baseline checkpoint is the original
safetensors
Hugging Face checkpoint linked in each table row, in thebfloat16
data type.
All speedup numbers are the measured performance of the model with input sequence length (ISL) of
1024
and output sequence length (OSL) of64
, on the specified backend, divided by the measured performance of the baseline.We assign the 🟢 validation score based on this criteria: all commands in the example guide ran successfully.
Alternate Flows#
Note
The alternate flows for LLMs described below are currently in the Early Access stage. Early Access features are features which are still undergoing some optimization and fine-tuning. These features are not in their final form and may change as we continue to work in order to mature them into full-fledged features.
OGA-based Flow with NPU-only Execution#
The primary OGA-based flow for LLMs employs an hybrid execution mode which leverages both the NPU and iGPU. AMD also provides support for an OGA-based flow where the iGPU is not sollicited and where the compute-intensive operations are exclusively offloaded to the NPU.
The OGA-based NPU-only execution mode is supported on STX and KRK platforms.
To get started with the OGA-based NPU-only execution mode, follow these instructions OGA NPU Execution Mode.
PyTorch-based Flow#
An experimental flow based on PyTorch is available here: amd/RyzenAI-SW
This flow provides functional support for a broad set of LLMs. It is intended for prototyping and experimental purposes only. It is not optimized for performance and it should not be used for benchmarking.
The Pytorch-based flow is supported on PHX, HPT and STX platforms.