Overview#

OGA-based Flow#

Ryzen AI Software supports deploying quantized 4-bit LLMs on Ryzen AI 300-series PCs using the OnnxRuntime GenAI (OGA) framework. OGA is a multi-vendor generative AI framework from Microsoft that provides a convenient LLM interface for execution backends such as Ryzen AI.

The flow supports two execution modes:

  • NPU-only execution mode: the compute-intensive operations are exclusively offloaded to the NPU. The iGPU is not solicited and can be used for other tasks.

  • Hybrid execution mode: the model is optimally partitioned such that different operations are scheduled on NPU or on the iGPU. This minimizes time-to-first-token (TTFT) in the prefill-phase and maximizes token generation (tokens per second, TPS) in the decode phase.

Supported Configurations#

  • Only Ryzen AI 300-series Strix Point (STX) and Krackan Point (KRK) processors support OGA-based hybrid execution.

  • Developers with Ryzen AI 7000- and 8000-series processors can get started using the CPU-based examples linked in the Featured LLMs table.

  • Windows 11 is the required operating system.

Development Interfaces#

The Ryzen AI LLM software stack is available through three development interfaces, each suited for specific use cases as outlined in the sections below. All three interfaces are built on top of native OnnxRuntime GenAI (OGA) libraries, as shown in the Ryzen AI Software Stack diagram below.

The high-level Python APIs, as well as the Server Interface, also leverage the Lemonade SDK, which is multi-vendor open-source software that provides everything necessary for quickly getting started with LLMs on OGA.

A key benefit of both OGA and Lemonade is that software developed against their interfaces is portable to many other execution backends.

* indicates open-source software (OSS).

High-Level Python SDK#

The high-level Python SDK, Lemonade, allows you to get started using PyPI installation in approximately 5 minutes.

This SDK allows you to:

  • Experiment with models in hybrid execution mode on Ryzen AI hardware.

  • Validate inference speed and task performance.

  • Integrate with Python apps using a high-level API.

To get started in Python, follow these instructions: High-Level Python SDK.

Server Interface (REST API)#

The Server Interface provides a convenient means to integrate with applications that:

  • Already support an LLM server interface, such as the Ollama server or OpenAI API.

  • Are written in any language (C++, C#, Javascript, etc.) that supports REST APIs.

  • Benefits from process isolation for the LLM backend.

To get started with the server interface, follow these instructions: Server Interface (REST API).

For example applications that have been tested with Lemonade Server, see the Lemonade Server Examples.

OGA APIs for C++ Libraries and Python#

Native C++ libraries for OGA are available to give full customizability for deployment into native applications.

The Python bindings for OGA also provide a customizable interface for Python development.

To get started with the OGA APIs, follow these instructions: OnnxRuntime GenAI (OGA) Flow.