Running LLM on Linux#
This page showcases an example of running LLM on RyzenAI NPU
Open a Linux terminal and create a new folder
mkdir run_llm
cd run_llm
Choose any prequantized and postprocessed ready-to-run Model from Hugging Face collection of NPU models
For this flow, “Phi-3.5-mini-instruct-onnx-ryzenai-npu” is chosen for reference
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/amd/Phi-3.5-mini-instruct-onnx-ryzenai-npu
Search for RYZEN_AI_INSTALLATION_PATH
echo $RYZEN_AI_INSTALLATION_PATH
<TARGET-PATH>/ryzen_ai-1.6.1/venv
# Activate the virtual environment
source <TARGET-PATH>/ryzen_ai-1.6.1/venv/bin/activate
Collecting the necessary files to get in current working directory
- Deployment folder - This has necessary libraries to run LLM Model
# Navigate to <TARGET-PATH>/ryzen_ai-1.6.1/venv and copy the "deployment" folder
cp -r <TARGET-PATH>/ryzen_ai-1.6.1/venv/deployment .
- Model Benchmark Script
# Navigate to <TARGET-PATH>/ryzen_ai-1.6.1/venv/LLM/examples/ and copy "model_benchmark" file.
cp <TARGET-PATH>/ryzen_ai-1.6.1/venv/LLM/examples/model_benchmark .
- Prompt file - Input to your LLM Model
# Navigate to <TARGET-PATH>/ryzen_ai-1.6.1/venv/LLM/examples/ and copy "amd_genai_prompt.txt" file.
cp <TARGET-PATH>/ryzen_ai-1.6.1/venv/LLM/examples/amd_genai_prompt.txt .
Current working directory should have below files
deployment model_benchmark amd_genai_prompt.txt Phi-3.5-mini-instruct-onnx-ryzenai-npu
Two files under Phi-3.5 Model have to be updated to make it work for Linux environment
1) Edit genai_config.json file under Model folder:
- "custom_ops_library": "deployment/lib/libonnx_custom_ops.so" (line 8)
- "config_entries": {
"hybrid_dbg_use_aie_rope": "0", (line 11 - Add flag under config_entries)
2) Edit .cache/MatMulNBits_2_0_meta.json file under Model folder:
# Python utility script helps convert Windows-style paths in "MatMulNBits_2_0_meta.json" to Linux-style paths
# Python utility script
import json
with open('Phi-3.5-mini-instruct-onnx-ryzenai-npu/.cache/MatMulNBits_2_0_meta.json','r') as f:
lines = f.readlines()
for i in range(len(lines)):
if '.cache' in lines[i]:
lines[i] = lines[i].replace('\\','/')
with open('Phi-3.5-mini-instruct-onnx-ryzenai-npu/.cache/MatMulNBits_2_0_meta.json','w') as f:
f.writelines(lines)
Lastly, add directories for LD_LIBRARY_PATH
export LD_LIBRARY_PATH=deployment/lib:$LD_LIBRARY_PATH
We can now run our Model with command below:
./model_benchmark -i Phi-3.5-mini-instruct-onnx-ryzenai-npu/ -l 128 -f amd_genai_prompt.txt
# Enable "-v" flag for verbose output
Expected output#
-----------------------------
Prompt Number of Tokens: 128
Batch size: 1, prompt tokens: 128, tokens to generate: 128
Prompt processing (time to first token):
avg (us): 442251
avg (tokens/s): 289.428
p50 (us): 442583
stddev (us): 4901.59
n: 5 * 128 token(s)
Token generation:
avg (us): 85353.7
avg (tokens/s): 11.716
p50 (us): 84689.3
stddev (us): 7012.99
n: 635 * 1 token(s)
Token sampling:
avg (us): 27.4852
avg (tokens/s): 36383.2
p50 (us): 27.652
stddev (us): 0.928063
n: 5 * 1 token(s)
E2E generation (entire generation loop):
avg (ms): 11282.4
p50 (ms): 11275.4
stddev (ms): 14.2974
n: 5
Peak working set size (bytes): 6736375808
Preparing OGA Model#
Preparing OGA Model is a two-step process
Model Quantization#
Follow Model Quantization steps described here Preparing OGA Models
Postprocessing#
Model Quantization step produces Pytorch quantized model.
Model_generate script initially converts Pytorch quantized model to Onnx format and subsequently postprocesses to run for NPU Execution mode.
pip install onnx-ir
model_generate --npu <output_dir> <quantized_model_path> --optimize decode
Expected Output
NPU optimize decode model generated successfully.