OGA Flow for Hybrid Execution of LLMs#
Note
Support for LLMs is currently in the Early Access stage. Early Access features are features which are still undergoing some optimization and fine-tuning. These features are not in their final form and may change as we continue to work in order to mature them into full-fledged features.
Starting with version 1.3, the Ryzen AI Software includes support for deploying LLMs on Ryzen AI PCs using the ONNX Runtime generate() API (OGA). This documentation is for the Hybrid execution mode of LLMs, which leverages both the NPU and GPU.
Supported Configurations#
The Ryzen AI OGA flow supports the following processors running Windows 11:
Strix (STX): AMD Ryzen™ Ryzen AI 9 HX375, Ryzen AI 9 HX370, Ryzen AI 9 365
Note: Phoenix (PHX) and Hawk (HPT) processors are not supported.
Requirements#
NPU Drivers (version .237): Install according to the instructions https://ryzenai.docs.amd.com/en/latest/inst.html
RyzenAI 1.3 MSI installer
Hybrid LLM artifacts package:
hybrid-llm-artifacts_1.3.0.zip
from https://account.amd.com/en/member/ryzenai-sw-ea.html
Setting performance mode (Optional)#
To run the LLMs in the best performance mode, follow these steps:
Go to
Windows
→Settings
→System
→Power
and set the power mode to Best Performance.Execute the following commands in the terminal:
cd C:\Windows\System32\AMD
xrt-smi configure --pmode performance
Package Contents#
Hybrid LLM artifacts package contains the files required to build and run applications using the ONNX Runtime generate() API (OGA) to deploy LLMs using the Hybrid execution mode. The list below describes which files are needed for the different use cases:
Python flow
onnx_utilsbinonnx_custom_ops.dll
onnxruntime_genaiwheelonnxruntime_genai_directml-0.4.0.dev0-cp310-cp310-win_amd64.whl
onnxruntime_genaibenchmarkDirectML.dll
C++ Runtime
onnx_utilsbinonnx_custom_ops.dll
onnxruntime_genaibenchmarkDirectML.dll
onnxruntime_genaibenchmarkD3D13Core.dll
onnxruntime_genaibenchmarkonnxruntime.dll
onnxruntime_genaibenchmarkryzenai_onnx_utils.dll
C++ Dev headers
onnx_utils
onnxruntime_genai
Examples
Pre-optimized Models#
AMD provides a set of pre-optimized LLMs ready to be deployed with Ryzen AI Software and the supporting runtime for hybrid execution. These models can be found on Hugging Face in the following collection:
https://huggingface.co/amd/Phi-3-mini-4k-instruct-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/Phi-3.5-mini-instruct-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/Mistral-7B-Instruct-v0.3-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/Qwen1.5-7B-Chat-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/chatglm3-6b-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/Llama-2-7b-hf-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/Llama-2-7b-chat-hf-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/Llama-3-8B-awq-g128-int4-asym-fp16-onnx-hybrid/tree/main
https://huggingface.co/amd/Llama-3.1-8B-awq-g128-int4-asym-fp16-onnx-hybrid/tree/main
https://huggingface.co/amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid
https://huggingface.co/amd/Llama-3.2-3B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid
The steps for deploying the pre-optimized models using Python or C++ are described in the following sections.
Hybrid Execution of OGA Models using Python#
Setup#
Install Ryzen AI 1.3 according to the instructions: https://ryzenai.docs.amd.com/en/latest/inst.html
Download and unzip the hybrid LLM artifacts package
Activate the Ryzen AI 1.3 Conda environment:
conda activate ryzen-ai-1.3.0
Install the wheel file included in the hybrid-llm-artifacts package:
cd path_to\\hybrid-llm-artifacts\onnxruntime_genai\wheel
pip install onnxruntime_genai_directml-0.4.0.dev0-cp310-cp310-win_amd64.whl
Run Models#
Clone model from the Hugging Face repository and switch to the model directory
Open the
genai_config.json
file located in the folder of the downloaded model. Update the value of the “custom_ops_library” key with the full path to theonnx_custom_ops.dll
,located in thehybrid-llm-artifacts\onnx_utils\bin
folder:
"session_options": {
...
"custom_ops_library": "path_to\\hybrid-llm-artifacts\\onnx_utils\\bin\\onnx_custom_ops.dll",
...
}
Copy the
DirectML.dll
file to the folder where theonnx_custom_ops.dll
is located (note: this step is only required on some systems)
copy hybrid-llm-artifacts\onnxruntime_genai\lib\DirectML.dll hybrid-llm-artifacts\onnx_utils\bin
Run the LLM
cd hybrid-llm-artifacts\scripts\llama3
python run_model.py --model_dir path_to\Meta-Llama-3-8B-awq-w-int4-asym-gs128-a-fp16-onnx-ryzen-strix-hybrid
Hybrid Execution of OGA Models using C++#
Setup#
Download and unzip the hybrid LLM artifacts package.
Copy required library files from
onnxruntime-genai\lib
toexamples\c\lib
copy onnxruntime_genai\lib\*.* examples\c\lib\
Copy
onnx_utils\bin\ryzenai_onnx_utils.dll
toexamples\c\lib
copy onnx_utils\bin\ryzenai_onnx_utils.dll examples\c\lib\
Copy required header files from
onnxruntime-genai\include
toexamples\c\include
copy onnxruntime_genai\include\*.* examples\c\include\
Build the
model_benchmark.exe
application
cd hybrid-llm-artifacts\examples\c
cmake -G "Visual Studio 17 2022" -A x64 -S . -B build
cd build
cmake --build . --config Release
Note: The model_benchmark.exe
executable is generated in the hybrid-llm-artifacts\examples\c\build\Release
folder
Run Models#
The model_benchmark.exe
test application serves two purposes:
It provides a very simple mechanism for running and evaluating Hybrid OGA models
The source code for this application provides a reference implementation for how to integrate Hybrid OGA models in custom C++ programs
To evaluate models using the model_benchmark.exe
test application:
# To see settings info
.\model_benchmark.exe -h
# To run with default settings
.\model_benchmark.exe -i $path_to_model_dir -f $prompt_file -l $list_of_prompt_lengths
# To show more informational output
.\model_benchmark.exe -i $path_to_model_dir -f $prompt_file --verbose
# To run with given number of generated tokens
.\model_benchmark.exe -i $path_to_model_dir -f $prompt_file -l $list_of_prompt_lengths -g $num_tokens
# To run with given number of warmup iterations
.\model_benchmark.exe -i $path_to_model_dir -f $prompt_file -l $list_of_prompt_lengths -w $num_warmup
# To run with given number of iterations
.\model_benchmark.exe -i $path_to_model_dir -f $prompt_file -l $list_of_prompt_lengths -r $num_iterations
For example:
cd hybrid-llm-artifacts\examples\c\build\Release
.\model_benchmark.exe -i <path_to>/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid -f <path_to>/prompt.txt -l "128, 256, 512, 1024, 2048" --verbose
Preparing OGA Models for Hybrid Execution#
This section describes the process for preparing LLMs for deployment on a Ryzen AI PC using the hybrid execution mode. Currently, the flow supports only fine-tuned versions of the models already supported (as listed in “Pre-optimized Models” section of this guide) in the hybrid flow. For example, fine-tuned versions of Llama2 or Llama3 can be used. However, different model families with architectures not supported by the hybrid flow cannot be used.
Preparing a LLM for deployment on a Ryzen AI PC using the hybrid execution mode involves 3 steps:
Quantizing the model: The pretrained model is quantized to reduce memory footprint and better map to compute resources in the hardware accelerators
Generating the OGA model: A model suitable for use with the ONNX Runtime generate() API (OGA) is generated from the quantized model.
Generating the final model for Hybrid execution: A model specialized for the hybrid execution mode is generated from the OGA model.
Quantizing the model#
Prerequisites#
Linux machine with AMD or Nvidia GPUs
Setup#
Create Conda Environment
conda create --name <conda_env_name> python=3.11
conda activate <conda_env_name>
If Using AMD GPUs, update PyTorch to use ROCm
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
python -c "import torch; print(torch.cuda.is_available())" # Must return `True`
Download
Quark 0.6.0
and unzip the archiveInstall Quark:
cd <extracted quark 0.6.0>
pip install quark-0.6.0+<>.whl
Perform quantization#
The model is quantized using the following command and quantization settings:
cd examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py
--model_dir "meta-llama/Llama-2-7b-chat-hf"
--output_dir <quantized safetensor output dir>
--quant_scheme w_uint4_per_group_asym
--num_calib_data 128
--quant_algo awq
--dataset pileval_for_awq_benchmark
--seq_len 512
--model_export quark_safetensors
--data_type float16
--exclude_layers []
--custom_mode awq
The quantized model is generated in the <quantized safetensor output dir> folder.
Generating the OGA model#
Setup#
1. Clone the onnxruntime-genai repo:
git clone --branch v0.5.1 https://github.com/microsoft/onnxruntime-genai.git
Install the packages
conda create --name oga_051 python=3.11
conda activate oga_051
pip install numpy
pip install onnxruntime-genai
pip install onnx
pip install transformers
pip install torch
pip install sentencepiece
Build the OGA Model#
Run the OGA model builder utility as shown below:
cd onnxruntime-genai/src/python/py/models
python builder.py \
-i <quantized safetensor model dir> \
-o <oga model output dir> \
-p int4 \
-e dml
The OGA model is generated in the <oga model output dir>
folder.
Generating the final model#
Setup#
1. Create and activate postprocessing environment
conda create -n oga_to_hybrid python=3.10
conda activate oga_to_hybrid
Install wheels
cd <hybrid package>\preprocessing
pip install ryzenai_dynamic_dispatch-1.1.0.dev0-cp310-cp310-win_amd64.whl
pip install ryzenai_onnx_utils-0.5.0-py3-none-any.whl
pip install onnxruntime
Generate the final model#
The commands below use the Phi-3-mini-4k-instruct
model (denoted as Phi-3-mini-4k
for brevity) as an example to demonstrate the steps for generating the final model.
Generate the Raw model:
cd <oga dml model folder>
mkdir tmp
onnx_utils --external-data-extension "onnx.data" partition model.onnx ./tmp hybrid_llm.yaml -v --save-as-external --model-name Phi-3-mini-4k_raw
The command generates:
tmp/Phi-3-mini-4k_raw.onnx
tmp/Phi-3-mini-4k_raw.onnx.data
Post-process the raw model to generate the JIT model:
onnx_utils postprocess .\tmp\Phi-3-mini-4k_raw.onnx .\tmp\Phi-3-mini-4k_jit.onnx hybrid_llm --script-options jit_npu
The command generates
Phi-3-mini-4k_jit.bin
Phi-3-mini-4k_jit.onnx
Phi-3-mini-4k_jit.onnx.data
Phi-3-mini-4k_jit.pb.bin
Move the files related to the JIT model (
.bin
,.onnx
,.onnx.data
and.pb.bin
) to the original model directory and removetmp
Remove original
model.onnx
and originalmodel.onnx.data
Open
genai_config.json
and change the contents of the file as show below:
Before
"session_options": {
"log_id": "onnxruntime-genai",
"provider_options": [
{
"dml": {}
}
]
},
"filename": "model.onnx",
Modified
"session_options": {
"log_id": "onnxruntime-genai",
"custom_ops_library": "onnx_custom_ops.dll",
"custom_allocator": "shared_d3d_xrt",
"external_data_file": "Phi-3-mini-4k_jit.pb.bin",
"provider_options": [
]
},
"filename": "Phi-3-mini-4k_jit.onnx",
The final model is now ready and can be tested with the
model_benchmark.exe
test application.