Preparing OGA Models#
This section describes the process for preparing LLMs for deployment on a Ryzen AI PC using the hybrid or NPU-only execution mode. Currently, the flow supports only fine-tuned versions of the models already supported (as listed in OnnxRuntime GenAI (OGA) Flow page). For example, fine-tuned versions of Llama2 or Llama3 can be used. However, different model families with architectures not supported by the hybrid flow cannot be used.
For fine-tuned models that introduce architectural changes requiring new operator shapes not available in the Ryzen AI runtime, refer to the Compiling Operators for OGA Models
Preparing a LLM for deployment on a Ryzen AI PC involves 2 steps:
Quantization: The pretrained model is quantized to reduce memory footprint and better map to compute resources in the hardware accelerators
Postprocessing: During the postprocessing the model is exported to OGA followed by NPU-only or Hybrid execution mode specific postprocess to obtain the final deployable model.
Quantization#
Prerequisites#
Linux machine with AMD (e.g., AMD Instinct MI Series) or Nvidia GPUs
Setup#
Create and activate Conda Environment
conda create --name <conda_env_name> python=3.12
conda activate <conda_env_name>
If Using AMD GPUs, update PyTorch to use ROCm
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
python -c "import torch; print(torch.cuda.is_available())" # Must return `True`
Download
AMD Quark 0.11and unzip the archiveInstall Quark:
cd <extracted amd quark-version>
pip install amd_quark-<version>+<>.whl
Install other dependencies
pip install datasets
pip install transformers==4.57.6
pip install accelerate
pip install evaluate
pip install nltk
Generate Quantized Model#
Use following command to run Quantization. In a GPU equipped Linux machine the quantization can take about 30-60 minutes.
cd examples/torch/language_modeling/llm_ptq/
python quantize_quark.py \
--no_trust_remote_code \
--model_dir "meta-llama/Llama-2-7b-chat-hf" \
--output_dir <quantized safetensor output dir> \
--quant_scheme uint4_wo_128 \
--num_calib_data 128 \
--seq_len 512 \
--quant_algo awq \
--dataset pileval_for_awq_benchmark \
--model_export hf_format \
--data_type <datatype> \
--exclude_layers []
Use
--data_type bfloat16for bf16 pretrained model. For fp32/fp16 pretrained model use--datatype float16Not using
--exclude_layersparameter may result in model-specific defaults which may exclude certain layers like output layers.To specify a group size other than 128, such as 32, use
--quant_scheme uint4_wo_32instead of--quant_scheme uint4_wo_128. Available group sizes are 32, 64, and 128 (e.g., uint4_wo_32, uint4_wo_64, uint4_wo_128)Quark supports quantizing layers with different group sizes, use
--layer_quant_scheme lm_head uint4_wo_32to quantize the model with 32 group size for lm_head
The quantized model is generated in the <quantized safetensor output dir> folder.
Note: For the Phi-4 model, the following quantization recipe is recommended for better accuracy:
Use
--quant_algo gptqAdd
--layer_quant_scheme lm_head uint4_wo_32
Note:: Currently the following files are not copied into the quantized model folder and must be copied manually:
For Phi-4 models:
configuration_phi3.pyFor ChatGLM-6b models:
tokenizer.json
Postprocessing#
Copy the quantized model to the Windows PC with Ryzen AI installed, and activate the Ryzen AI Conda environment.
conda activate ryzen-ai-<version>
Install the model-generate package:
pip install model-generate==1.7.1 --force-reinstall --no-deps --extra-index-url https://pypi.amd.com/ryzenai_llm/1.7.1/windows/simple/
Hybrid Execution Mode#
Generate the final model for Hybrid execution mode (NPU prefill phase + GPU token phase):
model_generate --hybrid --input <quantized_model_path> --output <output_dir>
NPU Execution Mode#
Several NPU optimization levels are available depending on model support and performance requirements.
Full Fusion (Best performance, recommended for supported models):
model_generate --npu --full_fusion --input <quantized_model_path> --output <output_dir>
Token Fusion (better tokens-per-second):
model_generate --npu --token_fusion --input <quantized_model_path> --output <output_dir>
Note: Token Fusion currently supports generating models with a 4K context length only. For longer context lengths (e.g., 16K), use the pre-built models available on Hugging Face.
Basic (safe default for new or untested models):
model_generate --npu --basic --input <quantized_model_path> --output <output_dir>
Eager:
model_generate --npu --eager --input <quantized_model_path> --output <output_dir>
OGA Export Only#
To export the quantized model to OGA format without performing any NPU or Hybrid postprocessing:
model_generate --oga_only --input <quantized_model_path> --output <output_dir>
Memory Optimization#
Add --mem_optimize to any recipe to optimize for 16 GB laptop configurations:
model_generate --hybrid --mem_optimize --input <quantized_model_path> --output <output_dir>
model_generate --npu --token_fusion --mem_optimize --input <quantized_model_path> --output <output_dir>
Note: During the model_generate step, the quantized model is first converted to an OGA model using ONNX Runtime GenAI Model Builder (version 0.11.2). It is possible to use a standalone environment for exporting an OGA model, refer to the official ONNX Runtime GenAI Model Builder documentation. Once you have an exported OGA model, you can pass it directly to the model_generate command with --input, which will skip the export step and perform only the post-processing.
Here are simple commands to export an OGA model from a quantized model using a standalone environment:
conda create --name oga_builder_env python=3.12
conda activate oga_builder_env
pip install onnxruntime-genai==0.11.2
# pip install other necessary packages
pip install ....
python3 -m onnxruntime_genai.models.builder -m <input quantized model> -o <output OGA model> -p int4 -e dml