Preparing OGA Models#
This section describes the process for preparing LLMs for deployment on a Ryzen AI PC using the hybrid or NPU-only execution mode. Currently, the flow supports only fine-tuned versions of the models already supported (as listed in OnnxRuntime GenAI (OGA) Flow page). For example, fine-tuned versions of Llama2 or Llama3 can be used. However, different model families with architectures not supported by the hybrid flow cannot be used.
For fine-tuned models that introduce architectural changes requiring new operator shapes not available in the Ryzen AI runtime, refer to the Compiling Operators for OGA Models
Preparing a LLM for deployment on a Ryzen AI PC involves 2 steps:
Quantization: The pretrained model is quantized to reduce memory footprint and better map to compute resources in the hardware accelerators
Postprocessing: During the postprocessing the model is exported to OGA followed by NPU-only or Hybrid execution mode specific postprocess to obtain the final deployable model.
Quantization#
Prerequisites#
Linux machine with AMD (e.g., AMD Instinct MI Series) or Nvidia GPUs
Setup#
Create and activate Conda Environment
conda create --name <conda_env_name> python=3.11
conda activate <conda_env_name>
If Using AMD GPUs, update PyTorch to use ROCm
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
python -c "import torch; print(torch.cuda.is_available())" # Must return `True`
Download
AMD Quark 0.10
and unzip the archiveInstall Quark:
cd <extracted amd quark-version>
pip install amd_quark-<version>+<>.whl
Install other dependencies
pip install datasets
pip install transformers
pip install accelerate
pip install evaluate
pip install nltk
Some models may require a specific version of transformers
. For example, ChatGLM3 requires version 4.44.0.
Generate Quantized Model#
Use following command to run Quantization. In a GPU equipped Linux machine the quantization can take about 30-60 minutes.
cd examples/torch/language_modeling/llm_ptq/
python quantize_quark.py \
--no_trust_remote_code \
--model_dir "meta-llama/Llama-2-7b-chat-hf" \
--output_dir <quantized safetensor output dir> \
--quant_scheme w_uint4_per_group_asym \
--group_size 128 \
--num_calib_data 128 \
--seq_len 512 \
--quant_algo awq \
--dataset pileval_for_awq_benchmark \
--model_export hf_format \
--data_type <datatype> \
--exclude_layers []
Use
--data_type bfloat16
for bf16 pretrained model. For fp32/fp16 pretrained model use--datatype float16
The quantized model is generated in the <quantized safetensor output dir> folder.
Note: For the Phi-4 model, the following quantization recipe is recommended for better accuracy:
Use
--quant_algo gptq
Add
--group_size_per_layer lm_head 32
Note:: Currently the following files are not copied into the quantized model folder and must be copied manually:
For Phi-4 models:
configuration_phi3.py
For ChatGLM-6b models:
tokenizer.json
Postprocessing#
Copy the quantized model to the Windows PC with Ryzen AI installed, activate the Ryzen AI Conda environment.
conda activate ryzen-ai-<version>
pip install onnx_ir
pip install torch==2.7.1
Generate the final model for Hybrid execution mode:
conda activate ryzen-ai-<version>
model_generate --hybrid <output_dir> <quantized_model_path>
Generate the final model for NPU execution mode:
conda activate ryzen-ai-<version>
model_generate --npu <output_dir> <quantized_model_path> --optimize decode
Generate model for hybrid execution mode (prefill fused version)
conda activate ryzen-ai-<version>
model_generate --hybrid <output_dir> <quantized_model_path> --optimize prefill
Prefill fused hybrid models are only supported for Phi-3.5-mini-instruct and Mistral-7B-Instruct-v0.2
Edit genai_config.json with the following entries
"decoder": { "session_options": { "log_id": "onnxruntime-genai", "custom_ops_library": "onnx_custom_ops.dll", "external_data_file": "token.pb.bin", "custom_allocator": "ryzen_mm", "config_entries": { "dd_cache": "", "hybrid_opt_token_backend": "gpu", "hybrid_opt_max_seq_length": "4096", "max_length_for_kv_cache": "4096" }, "provider_options": [] }, "filename": "fusion.onnx",
Note: During the model_generate
step, the quantized model is first converted to an OGA model using ONNX Runtime GenAI Model Builder (version 0.9.2). It is possible to use a standalone environment for exporting an OGA model, refer to the official ONNX Runtime GenAI Model Builder documentation. Once you have an exported OGA model, you can pass it directly to the model_generate
command, which will skip the export step and perform only the post-processing.
Here are simple commands to export OGA model from quantized model using a standalone environment
conda create --name oga_builder_env python=3.10
conda activate oga_buider_env
pip install onnxruntime-genai==0.9.2
# pip install other necessary packages
pip install ....
python3 -m onnxruntime_genai.models.builder -m <input quantized model> -o <output OGA model> -p int4 -e dml