OnnxRuntime GenAI (OGA) Flow#
Ryzen AI Software supports deploying LLMs on Ryzen AI PCs using the native ONNX Runtime Generate (OGA) C++ or Python API. The OGA API is the lowest-level API available for building LLM applications on a Ryzen AI PC. It supports the following execution modes:
Hybrid execution mode: This mode uses both the NPU and iGPU to achieve the best TTFT and TPS during the prefill and decode phases.
NPU-only execution mode: This mode uses the NPU exclusively for both the prefill and decode phases. Two types of NPU models are available:
Token Fusion models: Support long context up to 16K tokens with no additional configuration required.
Full Fusion models: Optimized for best performance, supporting up to 4096 total tokens (input + output).
Supported Configurations#
The Ryzen AI OGA flow supports Strix and Krackan Point processors. Phoenix (PHX) and Hawk (HPT) processors are not supported.
Requirements#
Install NPU Drivers and Ryzen AI MSI installer. See Installation Instructions for more details.
Install GPU device driver: Ensure GPU device driver https://www.amd.com/en/support is installed
Install Git for Windows (needed to download models from HF): https://git-scm.com/downloads
Pre-optimized Models#
AMD provides a set of pre-optimized LLMs ready to be deployed with Ryzen AI Software and the supporting runtime for hybrid and/or NPU-only execution. These include popular architectures such as Llama-2, Llama-3, Mistral, DeepSeek Distill models, Qwen-2, Qwen-2.5, Qwen-3, Gemma-2, Gemma-3, GPT-OSS, Phi-3, Phi-3.5, and Phi-4.
Hugging Face collection of hybrid models: https://huggingface.co/collections/amd/ryzen-ai-171-hybrid
Hugging Face collection of NPU Token Fusion models: https://huggingface.co/collections/amd/ryzen-ai-171-npu-16k
Hugging Face collection of NPU Full Fusion models: https://huggingface.co/collections/amd/ryzen-ai-171-npu-4k
NPU Models: Token Fusion vs Full Fusion#
AMD provides two types of NPU models:
Token Fusion models: These models support long context up to 16K tokens. They are pre-built and uploaded to Hugging Face — no additional configuration is required to use long context. Simply download and run the model.
Full Fusion models: These models are optimized for best inference performance but do not support long context. The total token count (input + output) must not exceed 4096.
Choose the model type based on your use case: Token Fusion for long context workloads, or Full Fusion for maximum throughput on shorter sequences.
Each OGA model folder contains a genai_config.json file. This file contains various configuration settings for the model. The session_option section is where information about specific runtime dependencies is specified.
Changes Compared to Previous Release#
OGA version is updated to v0.11.2 (Ryzen AI 1.7) from v0.9.2.2 (Ryzen AI 1.6.1).
For 1.7 release, a new set of hybrid and NPU models is published. Models from earlier releases are not compatible with this version. If you are using Ryzen AI 1.7, please download the updated models.
Two types of NPU models are now available: Token Fusion models (long context up to 16K tokens) and Full Fusion models (best performance, up to 4096 tokens).
Context length up to 4K tokens (combined input and output) is supported for Full Fusion NPU models. Extended context length up to 16K tokens is supported for Token Fusion NPU models and Hybrid models.
Compatible OGA APIs#
Pre-optimized hybrid or NPU LLMs can be executed using the official OGA C++ and Python APIs. The current release is compatible with OGA version 0.11.2. For detailed documentation and examples, refer to the official OGA repository: 🔗 microsoft/onnxruntime-genai
LLMs Test Programs#
The Ryzen AI installation includes test programs (in C++ and Python) that can be used to run LLMs and understand how to integrate them in your application.
The steps for deploying the pre-optimized models using the sample programs are described in the following sections.
Steps to run C++ program and sample python script.#
(Optional) Enable Performance Mode
To run LLMs in best performance mode, follow these steps:
Go to
Windows→Settings→System→Power, and set the power mode to Best Performance.Open a terminal and run:
cd C:\Windows\System32\AMD xrt-smi configure --pmode performance
Activate the Ryzen AI Conda Environment and install
torchlibrary.
Run the following commands:
conda activate ryzen-ai-<version>
pip install torch==2.7.1
This step is required for running the python script.
Note
For the C++ program, if you choose not to activate the Conda environment, open a Windows Command Prompt and manually set the environment variable before continuing:
set RYZEN_AI_INSTALLATION_PATH=C:\\Program Files\\RyzenAI\\<version>
C++ Program#
Use the model_benchmark.exe executable to test LLMs and identify DLL dependencies for C++ applications.
Set Up a working directory and copy required Files
mkdir llm_run
cd llm_run
:: Copy the sample C++ executable
xcopy /Y "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\model_benchmark.exe" .
:: Copy the sample prompt file
xcopy /Y "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\amd_genai_prompt.txt" .
:: Copy required DLLs
xcopy /Y "%RYZEN_AI_INSTALLATION_PATH%\deployment\." .
Download model from Hugging Face
:: Install Git LFS if you haven't already: https://git-lfs.com
git lfs install
:: Clone the model repository
git clone https://huggingface.co/amd/Llama-2-7b-chat-hf-onnx-ryzenai-hybrid
Run
model_benchmark.exe
.\model_benchmark.exe -i <path_to_model_dir> -f <prompt_file> -l <list_of_prompt_lengths>
:: Example:
.\model_benchmark.exe -i Llama-2-7b-chat-hf-onnx-ryzenai-hybrid -f amd_genai_prompt.txt -l "1024"
Long Context Support#
Ryzen AI supports long context (beyond 4096 tokens) for Hybrid models and Token Fusion NPU models.
Token Fusion NPU Models#
Token Fusion NPU models are pre-built with long context support up to 16K tokens. No additional configuration is required — simply download the model from Hugging Face and run it.
:: Example: Clone a Token Fusion NPU model
git clone https://huggingface.co/amd/Phi-3.5-mini-instruct-onnx-ryzenai-npu
:: Run with long context
.\model_benchmark.exe -i <path_to_model_dir> -f amd_genai_prompt_long.txt -l "16000"
Hybrid Models#
If the total number of tokens exceeds 4096 for a hybrid model, follow the steps below.
Steps to run long context:
Make the following changes in
genai_config.jsonfile.Add
"hybrid_opt_chunk_context": "1"undermodel.decoder.session_options.provider_options.RyzenAI.
{ "model": { "bos_token_id": 1, "context_length": 16384, "decoder": { "session_options": { "log_id": "onnxruntime-genai", "provider_options": [ { "RyzenAI": { "external_data_file": "model_jit.pb.bin", "hybrid_opt_free_after_prefill": "1", "hybrid_opt_max_seq_length": "4096", "hybrid_opt_chunk_context": "1" } } ] },
Add
"chunk_size":2048undersearch.
- “search”: {
“diversity_penalty”: 0.0, “do_sample”: false, “chunk_size”: 2048, …
Copy the
amd_genai_prompt_long.txtinto your working directory.
xcopy /Y "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\amd_genai_prompt_long.txt" .
Run the model using
model_benchmark.exeusing theamd_genai_prompt_long.txtprompt file.
.\model_benchmark.exe -i <path_to_model_dir> -f amd_genai_prompt_long.txt -l "16000"
Note
The sample test application model_benchmark.exe accepts -l for input token length and -g for output token length.
Full Fusion NPU models support up to 4096 tokens in total (input + output). By default,
-gis set to 128. If the input length is close to 4096, you must adjust-gso the sum of input and output tokens does not exceed 4096. For example,-l 4000 -g 96is valid (4000 + 96 ≤ 4096), while-l 4000 -g 128will exceed the limit and result in an error.Token Fusion NPU models support long context up to 16K tokens (input + output) with no additional configuration.
Hybrid models: The combined number of input and output tokens must not exceed the model’s
context_length. You can verify thecontext_lengthin thegenai_config.jsonfile. For example, if a model’scontext_lengthis 8,000, the total token count (input + output) must not exceed 8,000.
The long context feature has been tested for Token Fusion NPU models and Hybrid models up to 16,000 tokens.
Python Script#
Navigate to your working directory and download model.
:: Install Git LFS if you haven't already: https://git-lfs.com
git lfs install
:: Clone the model repository
git clone https://huggingface.co/amd/Llama-2-7b-chat-hf-onnx-ryzenai-hybrid
Run sample python script
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\run_model.py" -m <model_folder> -l <max_length>
:: Example command
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\run_model.py" -m "Llama-2-7b-chat-hf-onnx-ryzenai-hybrid" -l 256
Note
Some models may return non-printable characters in their output (for example, Qwen models), which can cause a crash while printing the output text. To avoid this, modify the provided script %RYZEN_AI_INSTALLATION_PATH%\LLM\example\run_model.py by adding a text sanitization function and updating the print statement as shown below.
Add sanitize_string function:
def sanitize_string(input_string):
return input_string.encode("charmap", "ignore").decode("charmap")
Update line 80 to print sanitized output:
print("Output:", sanitize_string(output_text))
This sanitization fix will be included in the run_model.py script in the next release.
Python Script (with Chat Template)#
For models that use chat templates, the model_chat.py script provides better output quality by automatically loading and applying the chat template from the model folder during inference. The script also supports single-prompt, multi-turn context cache testing, and interactive chat with timing output.
The script is included in the Ryzen AI installation:
:: Single prompt with timing
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\model_chat.py" -m <model_folder> -pr amd_genai_prompt.txt --timings
:: Long context support (increase context window to e.g. 16k)
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\model_chat.py" -m <model_folder> -pr amd_genai_prompt_long.txt -mpt 16000
:: Interactive chat
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\model_chat.py" -m <model_folder>
For the full list of options including multi-turn JSON testing, guided generation, and advanced flags, refer to the RyzenAI-SW repository.
It is highly recommended to use model_chat.py for the GPT-OSS-20B NPU model.
Vision Language Model (VLM)#
AMD provides a pre-optimized Gemma-3-4b-it multimodal model ready to be deployed with Ryzen AI Software. Support for this model is available starting with the Ryzen AI 1.7 release.
Model: Gemma-3-4b-it-mm-onnx-ryzenai-npu
VLM inference requires dedicated Python scripts, which are included in the Ryzen AI installation at %RYZEN_AI_INSTALLATION_PATH%\LLM\example\vlm.
Quick Inference#
Use vlm_run.py to quickly test a model and see output:
:: Basic inference
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\vlm\vlm_run.py" -m <model_folder> -i <image_path>
:: Custom prompt
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\vlm\vlm_run.py" -m <model_folder> -i <image_path> -p "What's in this image?"
:: Resize image before running
python "%RYZEN_AI_INSTALLATION_PATH%\LLM\example\vlm\vlm_run.py" -m <model_folder> -i <image_path> --image_size 1024 1024
For benchmarking scripts (vlm_benchmark.py, run_all_benchmarks.py) and detailed options, refer to the README in the vlm directory or the RyzenAI-SW repository.
Building C++ Applications#
A complete example including C++ source and build instructions is available in the RyzenAI-SW repository: amd/RyzenAI-SW
Using Fine-Tuned Models#
It is also possible to run fine-tuned versions of the pre-optimized OGA models.
To do this, the fine-tuned models must first be prepared for execution with the OGA flow. For instructions on how to do this, refer to the page about Preparing OGA Models.
After a fine-tuned model has been prepared for execution, it can be deployed by following the steps described previously in this page.
Running LLM via pip install#
In addition to the full RyzenAI software stack, we also provide standalone wheel files for the users who prefer using their own environment. To prepare an environment for running the Hybrid and NPU-only LLM independently, perform the following steps:
Create a new python environment and activate it.
conda create -n <env_name> python=3.12 -y
conda activate <env_name>
Install onnxruntime-genai wheel file.
pip install onnxruntime-genai-directml-ryzenai==0.11.2 --extra-index-url https://pypi.amd.com/ryzenai_llm/1.7.1/windows/simple/
pip install model-generate==1.7.1 --extra-index-url https://pypi.amd.com/ryzenai_llm/1.7.1/windows/simple/
Navigate to your working directory and download the desired Hybrid/NPU model
cd working_directory
git clone <link_to_model>
Run the Hybrid or NPU model.