Running LLM on Linux

Running LLM on Linux#

This page showcases an example of running LLM on RyzenAI NPU

  • Open a Linux terminal and create a new folder

mkdir run_llm
cd run_llm
  • Choose any prequantized and postprocessed ready-to-run Model from Hugging Face collection of NPU models

Models with 4K Context length

Models with 16K Context length
  • For this flow, “Phi-3.5-mini-instruct_rai_1.7.1_npu_4K” is chosen for reference

# Make sure git-lfs is installed (https://git-lfs.com)
sudo apt install git-lfs
git lfs install
git clone https://huggingface.co/amd/Phi-3.5-mini-instruct_rai_1.7.1_npu_4K
  • Search for RYZEN_AI_INSTALLATION_PATH

# Activate the virtual environment created in Linux Installation step
source <TARGET-PATH>/venv/bin/activate

echo $RYZEN_AI_INSTALLATION_PATH
  • Collecting the necessary files to get in current working directory

- Deployment folder - This has necessary libraries to run LLM Model
    # Navigate to <TARGET-PATH>/venv and copy the "deployment" folder
    cp -r <TARGET-PATH>/venv/deployment .

- Model Benchmark Script
    # Navigate to <TARGET-PATH>/venv/LLM/examples/ and copy "model_benchmark" file.
    cp <TARGET-PATH>/venv/LLM/examples/model_benchmark .

- Prompt file - Input to your LLM Model
    # Navigate to <TARGET-PATH>/venv/LLM/examples/ and copy "amd_genai_prompt.txt" file.
    cp <TARGET-PATH>/venv/LLM/examples/amd_genai_prompt.txt .
  • Current working directory should have below files

amd_genai_prompt.txt   deployment   model_benchmark   Phi-3.5-mini-instruct_rai_1.7.1_npu_4K
  • Create a new file for XRT Drivers named “xrt.ini”

- vi xrt.ini            (Creates a new file)

- Add below lines to the file and save it
    [Debug]
    num_heap_pages = 8

- Set XRT_INI_PATH to point to this file
    export XRT_INI_PATH=$PWD/xrt.ini
  • Lastly, set required library path

export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=deployment/lib:$LD_LIBRARY_PATH
export RYZENAI_EP_PATH=$PWD/deployment/lib/libonnxruntime_providers_ryzenai.so
  • We can now run our Model with command below:

./model_benchmark -i Phi-3.5-mini-instruct_rai_1.7.1_npu_4K/ -l 128 -f amd_genai_prompt.txt

# Enable "-v" flag for verbose output

Expected output#

-----------------------------
Prompt Number of Tokens: 128

Batch size: 1, prompt tokens: 128, tokens to generate: 128
Prompt processing (time to first token):
       avg (us):       148056
       avg (tokens/s): 864.536
       p50 (us):       148143
       stddev (us):    375.335
       n:              5 * 128 token(s)
Token generation:
       avg (us):       56874.3
       avg (tokens/s): 17.5826
       p50 (us):       56250.6
       stddev (us):    6743.11
       n:              635 * 1 token(s)
Token sampling:
       avg (us):       27.273
       avg (tokens/s): 36666.3
       p50 (us):       27.21
       stddev (us):    0.202461
       n:              5 * 1 token(s)
E2E generation (entire generation loop):
       avg (ms):       7371.29
       p50 (ms):       7378.4
       stddev (ms):    14.3836
       n:              5
Peak working set size (bytes): 12168941568

Preparing OGA Model#

Install “model_generate” package in current virtual environment

pip install model-generate==1.7.1 --force-reinstall --no-deps --extra-index-url https://pypi.amd.com/ryzenai_llm/1.7.1/linux/simple/

Currently Linux supports NPU only flow. Read more on Model Generation by visiting Preparing OGA Models