LLM Flow#
The Ryzen AI Software includes support for deploying quantized LLMs on the NPU using an eager execution mode, simplifying the model ingestion process. Instead of compiling and executing as a complete graph, the model is processed on an operator-by-operator basis. Compute-intensive operations, such as GEMM/MATMUL, are dynamically offloaded to the NPU, while the remaining operators are executed on the CPU. Eager mode for LLMs is supported in both PyTorch and the ONNX Runtime.
A general-purpose flow can be found here: amd/RyzenAI-SW
Applicability: prototyping and early development with a broad set of LLMs
Performance: functional support only, not to be used for benchmarking
Supported platforms: PHX, HPT, STX (and onwards)
Supported frameworks: Pytorch
Supported models: Many
A set of performance-optimized models is available upon request on the AMD secure download site: https://account.amd.com/en/member/ryzenai-sw-ea.html
Applicability: benchmarking and deployment of specific LLMs
Performance: highly optimized
Supported platforms: STX (and onwards)
Supported frameworks: Pytorch and ONNX Runtime
Supported models: Llama2, Llama3, Qwen1.5