Overview#
LLM Deployment on Ryzen AI#
Large Language Models (LLMs) can be deployed on Ryzen AI PCs with NPU and GPU acceleration. NPU-only and Hybrid execution modes, which utilize both the NPU and integrated GPU (iGPU), are supported via ONNXRuntime GenAI (OGA). GPU-only acceleration is enabled through llama.cpp. See the LLM Execution Mode Comparison below for detailed information.
Execution Modes#
Mode |
Framework(s) |
Compute Allocation |
Primary Use Case |
---|---|---|---|
NPU-Only |
OnnxRuntime GenAI (OGA) |
Neural Processing Unit (NPU) exclusive |
Maximum NPU utilization while preserving iGPU for parallel workloads |
Hybrid |
OnnxRuntime GenAI (OGA) |
Dynamic NPU + iGPU partitioning |
Interactive inference with optimal prefill/decode performance |
GPU |
llama.cpp |
Dedicated GPU execution |
High-throughput inference on discrete/integrated GPU |
CPU |
OGA or llama.cpp |
Traditional CPU-based inference |
Baseline compatibility across all processor generations |
Hardware Requirements#
Processor Series |
NPU-Only |
Hybrid |
GPU/CPU |
---|---|---|---|
Ryzen AI 300 (STX/KRK) |
✓ |
✓ |
✓ |
Ryzen AI 7000/8000 |
✗ |
✗ |
✓ |
Development Interfaces#
The Ryzen AI LLM software stack is available through three development interfaces, each suited for specific use cases as outlined in the sections below. All three interfaces are built on top of native OnnxRuntime GenAI (OGA) libraries or llama.cpp libraries, as shown in the Ryzen AI Software Stack diagram below.
The high-level Python APIs, as well as the Server Interface, also leverage the Lemonade SDK, which is multi-vendor open-source software that provides everything necessary for quickly getting started with LLMs on OGA or llama.cpp.
A key benefit of Lemonade is that software developed against their interfaces is portable to many other execution backends.
Your Python Application |
Your LLM Stack |
Your Native Application |
---|---|---|
* indicates open-source software (OSS).
Server Interface (REST API)#
The Server Interface provides a convenient means to integrate with applications that:
Already support an LLM server interface, such as the Ollama server or OpenAI API.
Are written in any language (C++, C#, Javascript, etc.) that supports REST APIs.
Benefits from process isolation for the LLM backend.
Lemonade Server is available in two ways:
Standalone Windows GUI installer: Quick setup with a desktop shortcut for immediate use. (Recommended for end users, see Server Interface (REST API))
Full Lemonade SDK: Complete development toolkit with server interface included. (Recommended for developers, see High-Level Python SDK for Python SDK)
For example applications that have been tested with Lemonade Server, see the Lemonade Server Examples.
High-Level Python SDK#
The high-level Python SDK, Lemonade, allows you to get started using PyPI installation in approximately 5 minutes.
This SDK allows you to:
Experiment with models in hybrid or NPU-only execution mode on Ryzen AI hardware.
Validate inference speed and task performance.
Integrate with Python apps using a high-level API.
To get started in Python, follow these instructions: High-Level Python SDK.
OGA APIs for C++ Libraries and Python#
Native C++ libraries for OGA are available to give full customizability for deployment into native applications. The Python bindings for OGA also provide a customizable interface for Python development.
To get started with the OGA APIs, follow these instructions: OnnxRuntime GenAI (OGA) Flow.
Supported LLMs#
The comprehensive set of pre-optimized models for hybrid execution are available in the AMD hybrid collection on Hugging Face and the NPU-only examples are available in the AMD NPU collection on Hugging Face. It is also possible to run fine-tuned versions of the models listed (for example, fine-tuned versions of Llama2 or Llama3). For instructions on how to prepare a fine-tuned OGA model, refer to Preparing OGA Models.
End to End OGA Validation#
A Jupyter Notebook example is provided to demonstrate end-to-end validation of OGA hybrid and NPU-only execution. This notebook includes:
Installation
Command Syntax
Benchmarking
Subjective Evaluation
Objective Evaluation
To run the notebook, visit the Lemonade Tools Tutorial.