Model Deployment#

ONNX Runtime with Vitis AI Execution Provider#

Quantized models are deployed by creating an ONNX inference session and by leveraring the Vitis AI Execution Provider (VAI EP). Both the ONNX C++ and Python APIs are supported.

providers = ['VitisAIExecutionProvider']
session = ort.InferenceSession(model, sess_options = sess_opt,
                                          providers = providers,
                                          provider_options = provider_options)

Vitis AI Execution Provider Options#

The Vitis AI Execution Provider supports the following options:

Provider Options

Type

Default

Description

config_file

Mandatory

None

The path and name of the runtime configuration file. A default version of this file can be found in the voe-4.0-win_amd64 folder of the Ryzen AI software installation package under the name vaip_config.json.

cacheDir

Optional

C:\temp\{user}\vaip\.cache

The cache directory.

cacheKey

Optional

{onnx_model_md5}

Compiled model directory generated inside the cache directory. Use string to specify the desired name of the compiler model directory. For example: 'cacheKey': 'resnet50_cache'.

encryptionKey

Optional

None

Encryption/Decryption key for the models generated.

Environment Variables#

Additionally, the following environment variables can be used control the Ryzen AI ONNX Runtime-based deployment.

Environment Variable

Type

Default

Description

XLNX_VART_FIRMWARE

Mandatory

None

Set it to one of the NPU configuration binaries. For more details, refer to the Runtime Setup page.

XLNX_ENABLE_CACHE

Optional

1

If unset, the runtime flow ignores the cache directory and recompiles the model.


Python API Example#

import onnxruntime

# Add user imports
# ...

# Load inputs and perform preprocessing
# ...

# Create an inference session using the Vitis AI execution provider
session = onnxruntime.InferenceSession(
              '[model_file].onnx',
               providers=["VitisAIExecutionProvider"],
               provider_options=[{"config_file":"/path/to/vaip_config.json"}])

input_shape = session.get_inputs()[0].shape
input_name = session.get_inputs()[0].name

# Load inputs and do preprocessing by input_shape
input_data = [...]
result = session.run([], {input_name: input_data})

C++ API Example#

#include <onnxruntime_cxx_api.h>
// include user header files
// ...
std::string xclbin_path = "path/to/xclbin";
std::string model_path  = "path/to/model.onnx";
std::string config_path = "path/to/config.json";
auto model_name = strconverter.from_bytes(model_path);

_putenv_s("XLNX_VART_FIRMWARE", xclbin_path.c_str());

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "quicktest");

// create inference session
auto session_options = Ort::SessionOptions();
auto options = std::unordered_map<std::string, std::string>{
    {"config_file", config_path},          // Required
    {"cacheDir",    "path/to/cacheDir"},   // Optional
    {"cacheKey",    "cacheName"}           // Optional
};
session_options.AppendExecutionProvider_VitisAI(options);
auto session = Ort::Session(env, model_name.data(), session_options);

// preprocess input data
// ...


// get input/output names from model
size_t                   input_count;
size_t                   output_count;
std::vector<const char*> input_names;
std::vector<const char*> output_names;
...

// initialize input tensors
std::vector<Ort::Value>  input_tensors;
...

// run inference
auto output_tensors = session.Run(
        Ort::RunOptions(),
        input_names.data(), input_tensors.data(), input_count,
        output_names.data(), output_count);

// postprocess output data
// ...

Simultaneous Sessions#

Up to eight simultaneous inference sessions can be run on the NPU. The runtime automatically schedules each inference session on available slots to maximize performance of the application.

The performance of individual inference sessions is impacted by multiple factors, including the APU type, the NPU configuration used, the number of other inference sessions running on the NPU, and the applications running the inference sessions.


Model Encryption#

To protect developers’ intellectual property, encryption is supported as a session option. With this enabled, all the compiled models generated are encrypted using AES256. To enable encryption, you need to pass the encryption key through the VAI EP options as follows:

In Python:

session = onnxruntime.InferenceSession(
    '[model_file].onnx',
    providers=["VitisAIExecutionProvider"],
    provider_options=[{
        "config_file":"/path/to/vaip_config.json",
        "encryptionKey": "89703f950ed9f738d956f6769d7e45a385d3c988ca753838b5afbc569ebf35b2"
}])

In C++:

auto onnx_model_path = "resnet50_pt.onnx"
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "resnet50_pt");
auto session_options = Ort::SessionOptions();
auto options = std::unorderd_map<std::string,std::string>({});
options["config_file"] = "/path/to/vaip_config.json";
options["encryptionKey"] = "89703f950ed9f738d956f6769d7e45a385d3c988ca753838b5afbc569ebf35b2";

session_options.AppendExecutionProvider("VitisAI", options);
auto session = Ort::Experimental::Session(env, model_name, session_options);

The key is a 256-bit value represented as a 64-digit string. The model generated in the cache directory cannot be opened with Netron currently. Additionally, there is a side effect: dumping is disabled to prevent the leakage of sensitive information about the model.


Operator Assignment Report#

Vitis AI EP generates a file named vitisai_ep_report.json that provides a report on model operator assignments across CPU and NPU. This file is automatically generated in the cache directory, which by default is C:\temp\{user}\vaip\.cache\<model_cache_key> if no explicit cache location is specified in the code. This report includes information such as the total number of nodes, the list of operator types in the model, and which nodes and operators runs on the NPU or on the CPU. (NOTE: Nodes and operators running on the NPU are reported under the DPU name). Additionally, the report includes node statistics, such as input to a node, the applied operation, and output from the node.

{
  "deviceStat": [
  {
    "name": "all",
    "nodeNum": 402,
    "supportedOpType": [
    "::Add",
    ...
    ]
  },
  {
    "name": "CPU",
    "nodeNum": 2,
    "supportedOpType": [
    "::DequantizeLinear",
    ...
    ]
  },
  {
    "name": "DPU",
    "nodeNum": 400,
    "supportedOpType": [
    "::Add",
    ...
    ]
  }
  ],
  ...