High-Level Python SDK#

A Python environment is a great way to get started because it provides maximum flexibility for trying out models, profiling models, and integration into Python applications.

We use the lemonade SDK from the ONNX TurnkeyML project to get up and running quickly.

To get started, follow these instructions.

System-level pre-requisites#

You only need to do this once per computer:

  1. Make sure your system has the Ryzen AI 1.3 driver installed:

  • Download the NPU driver installation package NPU Driver

  • Install the NPU drivers by following these steps:

    • Extract the downloaded NPU_RAI1.3.zip zip file.

    • Open a terminal in administrator mode and execute the .\npu_sw_installer.exe exe file.

  • Ensure that NPU MCDM driver (Version:32.0.203.237 or 32.0.203.240) is correctly installed by opening Device Manager -> Neural processors -> NPU Compute Accelerator Device.

  1. Download and install miniconda for Windows.

  2. Launch a terminal and call conda init.

Environment Setup#

To create and set up an environment, run these commands in your terminal:

conda create -n ryzenai-llm python=3.10
conda activate ryzenai-llm
pip install turnkeyml[llm-oga-hybrid]
lemonade-install --ryzenai hybrid

Validation Tools#

Now that you have completed installation, you can try prompting an LLM like this (where PROMPT is any prompt you like).

Run this command in a terminal that has your environment activated:

lemonade -i amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid oga-load --device hybrid --dtype int4 llm-prompt --max-new-tokens 64 -p PROMPT

Each example linked in the Supported LLMs table also has example commands for validating the speed and accuracy of each model.

Python API#

You can also run this code to try out the high-level lemonade API in a Python script:

from lemonade.api import from_pretrained

model, tokenizer = from_pretrained(
    "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid", recipe="oga-hybrid"
)

input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
response = model.generate(input_ids, max_new_tokens=30)

print(tokenizer.decode(response[0]))

Each example linked in the Supported LLMs table also has an example script for streaming the text output of the LLM.

Next Steps#

From here, you can check out an example or any of the other Supported LLMs.

The examples pages also provide code for:

  1. Additional validation tools for measuring speed and accuracy.

  2. Streaming responses with the API.

  3. Integrating the API into applications.

  4. Launching the server interface from the Python environment.