AI NPU Module

Supported ICs: [RTL8735C]

Overview

The platform integrates a dedicated Neural Processing Unit (NPU) that offloads Deep Neural Network (DNN) computation from the main CPU, enabling real-time AI inference - such as object detection and face recognition - with low power consumption.

At INT8 precision, the NPU delivers approximately 1 TOPS at the 600 MHz operating frequency, backed by 256 KB on-chip SRAM (VIP_SRAM) for intermediate tensors and a dedicated 12 MB DDR region for model weights and I/O buffers (see NN Memory Layout).

For detailed hardware architecture, supported layer types, quantisation formats, and software stack, see NPU Hardware Reference.

Integration Overview

All NN examples follow the same integration pattern. The application opens a V5 video channel as an RGB (or NV12) source, passes frames to the VIPNN module for NPU inference, and receives structured results through a result callback function.

[Video V5: RGB NN_WIDTH x NN_HEIGHT @ fps] --> SISO --> [VIPNN module]
                                                              |
                                                       nn_display_cb()

Steps to get NN running:

Select a pre-built model from Pre-Built Model Library (or request a custom model - see Custom Model Conversion).
Get the .nb model binary onto the device file system - see Deploying Models to Device.
Configure the VIPNN module in your application - see VIPNN Module.
Implement the result callback to act on inference output.

Pre-Built Model Library

The following pre-compiled .nb model binaries are included in the SDK:

component/soc/<soc>/video/nn/app/nn_model/binary/

To use a model, the .nb file must be present on the device file system before running the application. See Deploying Models to Device for how to flash or copy model files to the device.

Object Detection

YOLO Series

YOLO (You Only Look Once) is a widely used real-time object detection algorithm. The following variants are provided:

Model object	Binary filename	Input size	Quantised
`yolov4_tiny`	`yolov4_tiny_asymu8.nb`	416 x 416	uint8
`yolov7_tiny`	`yolov7_tiny_576x320_asymu8.nb`	576 x 320	uint8
`yolov7_tiny`	`yolov7_tiny_640x480_asymu8.nb`	640 x 480	uint8
`yolov9_tiny`	`yolov9_tiny_dfpi16.nb`	416 x 416	int16 DFP

Include the corresponding header and select the model object at compile time:

#include "model_yolo.h"    // yolov4_tiny, yolov7_tiny
#include "model_yolov9.h"  // yolov9_tiny

#define NN_MODEL_OBJ    yolov7_tiny
#define NN_MODEL_NAME   "vfs:/yolov7_tiny_576x320_asymu8.nb"
#define NN_WIDTH        576
#define NN_HEIGHT       320

The output of each detected object is stored as objdetect_res_t:

typedef struct objdetect_res_s {
    union {
        float result[6];   // [class_id, score, top_x, top_y, bot_x, bot_y]
        detobj_t res;
    };
} objdetect_res_t;

All coordinates are normalised to [0.0, 1.0] relative to the logical detector input size. In the common case this is the network tensor size. If the .nb model includes an in-graph preprocessing or scaling layer, the application can provide model_width / model_height in nn_data_param_t so YOLO post-processing decodes boxes against the logical detector size. The YOLO models are trained on the COCO dataset (80 classes).

For more information: https://github.com/AlexeyAB/darknet

Face Detection

SCRFD

SCRFD (Sample and Computation Redistribution for Face Detection) is a lightweight, high-accuracy face detector that outputs bounding boxes and 5-point facial landmarks.

Model object	Binary filename	Input size	Quantised
`scrfd`	`scrfd_500m_bnkps_shape576x320.nb`	576 x 320	uint8

#include "model_scrfd.h"

#define NN_MODEL_OBJ    scrfd
#define NN_MODEL_NAME   "vfs:/scrfd_500m_bnkps_shape576x320.nb"
#define NN_WIDTH        576
#define NN_HEIGHT       320

The detection result is stored as facedetect_res_t:

typedef struct facedetect_res_s {
    union {
        float result[6];    // [class_id, score, top_x, top_y, bot_x, bot_y]
        detobj_t res;
    };
    landmark_t landmark;    // 5 facial landmark points (x, y) normalised to [0.0, 1.0]
} facedetect_res_t;

For more information: https://github.com/deepinsight/insightface/tree/master/detection/scrfd

Face Recognition

MobileFaceNet

MobileFaceNet is a compact face recognition model trained with ArcFace (Additive Angular Margin Loss). It takes a cropped and aligned face image and outputs a 128-dimensional feature embedding for identity matching.

MobileFaceNet is typically used together with SCRFD in a cascaded detect-then-recognise pipeline: SCRFD first detects and localises faces in the full frame, then MobileFaceNet extracts an embedding from each cropped face region for comparison against a stored identity database. See Cascaded Mode for the VIPNN configuration.

Model object	Binary filename	Input size	Quantised
`mbfacenet_fwfs`	`mobilefacenet_pcqsymi8.nb`	112 x 112	int8 sym

#include "model_mobilefacenet.h"

The recognition result is stored as face_feature_res_t:

#define MAX_FACE_FEATURE_DIM 128
typedef struct face_feature_res_s {
    union {
        float result[6];
        detobj_t res;
    };
    float feature[MAX_FACE_FEATURE_DIM];  // 128-dim face embedding
} face_feature_res_t;

For more information: https://github.com/deepinsight/insightface/tree/master/recognition

Model Memory and File Size Reference

The following table lists the memory footprint of each SDK model. The DDR memory column represents the NPU runtime memory (model weights + I/O tensor buffers). All models fit within the 12 MB NN_DDR window.

Category	Model binary	Input size	Quantised	DDR memory usage	File size
Object detection	`yolov4_tiny_asymu8.nb`	416 x 416	uint8	6.51 MB	3.59 MB
Object detection	`yolov7_tiny_576x320_asymu8.nb`	576 x 320	uint8	7.25 MB	3.96 MB
Object detection	`yolov7_tiny_640x480_asymu8.nb`	640 x 480	uint8	10.13 MB	3.83 MB
Object detection	`yolov9_tiny_dfpi16.nb`	416 x 416	int16 DFP	10.22 MB	4.73 MB
Face detection	`scrfd_500m_bnkps_shape576x320.nb`	576 x 320	uint8	2.28 MB	0.75 MB
Face recognition	`mobilefacenet_pcqsymi8.nb`	112 x 112	int8 sym	2.06 MB	1.40 MB

Note

When running two models simultaneously (e.g. SCRFD + MobileFaceNet for detect-then-recognise), ensure the combined DDR memory usage does not exceed the 12 MB NN_DDR budget. The SCRFD + MobileFaceNet combination uses approximately 4.3 MB in total.

Deploying Models to Device

The VIPNN module loads model binaries at runtime using a path prefix:

vfs:/ - reads from the internal LittleFS flash partition (VFS1)
sd:/ - reads from an SD card

For quick prototyping, copying the .nb file to the root of an SD card and setting sd:/model.nb as the model path requires no additional build steps. For production or devices without an SD card slot, use the vfs: path described below.

Flashing Models to LittleFS (VFS1)

VFS1 is the LittleFS flash partition defined in the flash layout (component/soc/usrcfg/<soc>/ameba_flashcfg.c). The current SDK stores Wi-Fi, BT, and NN data in this single LittleFS region because only one LittleFS flash region is supported at runtime:

{VFS1, 0x088A3000, 0x08EA2FFF}   /* VFS region 1: wifi/BT/NN data (6 MB) */
{VFS2, 0xFFFFFFFF, 0xFFFFFFFF}

Note

The address range shown above is for reference only. Always verify the actual VFS1 partition address and size in component/soc/usrcfg/<soc>/ameba_flashcfg.c before flashing, as the layout may differ depending on your firmware configuration.

Step 1 - Prepare the model directory

Create a local directory under tools/littlefs/linux and copy the model files used by the SDK NN video examples into it:

cd tools/littlefs/linux
mkdir -p nn_model
cp ../../../component/soc/<soc>/video/nn/app/nn_model/binary/mobilefacenet_pcqsymi8.nb nn_model/
cp ../../../component/soc/<soc>/video/nn/app/nn_model/binary/scrfd_500m_bnkps_shape576x320.nb nn_model/
cp ../../../component/soc/<soc>/video/nn/app/nn_model/binary/yolov4_tiny_asymu8.nb nn_model/

This packs the three common example models into one LittleFS image:

mobilefacenet_pcqsymi8.nb for MobileFaceNet face embedding
scrfd_500m_bnkps_shape576x320.nb for SCRFD face detection
yolov4_tiny_asymu8.nb for YOLO object detection

Step 2 - Build the LittleFS image

Run the following command from tools/littlefs/linux:

./mklittlefs -b 4096 -p 4096 -s 0x600000 -c nn_model/ nn_model_lfs.bin

After the image is generated, list the files inside the LittleFS image to verify that the expected model files were packed:

./mklittlefs -b 4096 -p 4096 -s 0x600000 -l nn_model_lfs.bin

Example output:

1466168    /mobilefacenet_pcqsymi8.nb             Mon Nov 17 08:57:48 2025
 787568    /scrfd_500m_bnkps_shape576x320.nb      Mon Nov 17 08:57:48 2025
3763536    /yolov4_tiny_asymu8.nb                 Mon Nov 17 08:57:48 2025

Option	Description
`-b 4096`	Block size in bytes (matches the flash erase block size)
`-p 4096`	Page size in bytes
`-s 0x600000`	Image size - must match the VFS1 partition size (6 MB)
`-c nn_model/`	Input directory to pack
`nn_model_lfs.bin`	Output LittleFS image file

Step 3 - Flash the image

Flash nn_model_lfs.bin to the VFS1 start address (0x088A3000) using the image download tool. After a successful flash, the model files are accessible at runtime via the vfs: prefix:

#define FACEDET_MODEL_NAME   "vfs:/scrfd_500m_bnkps_shape576x320.nb"
#define FACENET_MODEL_NAME   "vfs:/mobilefacenet_pcqsymi8.nb"
#define OBJDET_MODEL_NAME    "vfs:/yolov4_tiny_asymu8.nb"

Note

The total size of all packed files must not exceed the VFS1 partition size (6 MB). The three-model example above is approximately 5.74 MB, which fits in the default 6 MB VFS1 image. Refer to Model Memory and File Size Reference for the file size of each pre-compiled model.

VIPNN Module

The NN MMF module - vipnn - accepts RGB or NV12 frames from the video pipeline, runs inference on the NPU, and delivers structured post-processed results to the application via a callback function.

Pre-processing and post-processing are bundled with each model object (nnmodel_t), so adding a new model requires only providing a new model object - the VIPNN module itself does not need to change.

VIPNN Module Context

The internal context of the VIPNN module:

typedef struct vipnn_ctx_s {
    void *parent;
    vip_network network;                              // NPU network handle
    vip_buffer_create_params_t vip_param_in[MAX_IO_NUM];
    vip_buffer_create_params_t vip_param_out[MAX_IO_NUM];
    vip_buffer input_buffers[MAX_IO_NUM];
    vip_buffer output_buffers[MAX_IO_NUM];
    vipnn_params_t params;                            // module parameters
    vipnn_status_t status;
    char network_name[64];
    int input_count;
    int output_count;
    vipnn_preproc_t  pre_process;                    // custom pre-process hook
    vipnn_postproc_t post_process;                   // custom post-process hook
    disp_postprcess_t disp_postproc;                 // result display callback
    vipnn_cascaded_mode_t cas_mode;
    bool module_out_en;
    vipnn_measure_t measure;                         // inference FPS measurement
} vipnn_ctx_t;

Module Parameters

The vipnn_params_t structure holds the runtime parameters for the module:

typedef struct vipnn_param_s {
    char     model_file[64];    // model file path on file system (e.g. "vfs:/model.nb")
    uint8_t *model_mem;         // pointer to model in memory (alternative to file path)
    uint32_t model_size;        // model size in bytes (when using model_mem)
    int      fps;               // target inference FPS (0 = unlimited)
    int      out_res_size;      // sizeof one result structure
    int      out_res_max_cnt;   // maximum number of results per frame
    int      save_out_tensor;   // set to 1 to dump raw output tensors for offline debugging
    nn_data_param_t *in_param;  // input image parameters
    nnmodel_t       *model;     // pointer to the model object
} vipnn_params_t;

The image part of nn_data_param_t describes the frame consumed by VIPNN:

typedef struct nn_data_param_s {
    union {
        struct {
            int width, height;
            int model_width, model_height;  // optional logical detector size
            landmarki_t landmark;
        } img;
        /* audio fields omitted */
    };
    uint32_t codec_type;
    void *priv;
    int size_in_byte;
} nn_data_param_t;

Note

Set save_out_tensor = 1 to dump raw NPU output tensors to a file. This is useful when developing or verifying custom post-processing logic on a PC. Disable this flag in production builds.

When model_mem is set (non-NULL), the module loads the model from that memory pointer instead of the file system. This is useful for embedding the model binary directly into firmware rather than storing it in a separate file system partition.

Complete Module Initialisation

static nn_data_param_t nn_input_params = {
    .img = {
        .width  = NN_WIDTH,
        .height = NN_HEIGHT,
    },
    .codec_type = AV_CODEC_ID_RGB888
};

vipnn_ctx = mm_module_open(&vipnn_module);
if (vipnn_ctx) {
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL,           (int)&NN_MODEL_OBJ);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL_FILE_NAME, (int)nn_model_file_name);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_IN_PARAMS,       (int)&nn_input_params);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_DISPPOST,        (int)nn_display_cb);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_RES_SIZE,        sizeof(objdetect_res_t));
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_RES_MAX_CNT,     32);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_APPLY,               0);
}

Setting the Input Image Parameters

Use CMD_VIPNN_SET_IN_PARAMS to describe the input frame passed to the VIPNN module:

nn_data_param_t nn_input_params = {
    .img = {
        .width      = NN_WIDTH,   // incoming RGB/NV12 frame width
        .height     = NN_HEIGHT,  // incoming RGB/NV12 frame height
    },
    .codec_type = AV_CODEC_ID_RGB888   // or AV_CODEC_ID_NV12
};
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_IN_PARAMS, (int)&nn_input_params);

For models that include an in-graph preprocessing or scaling layer, the incoming V5 RGB frame size can differ from the logical detector input size. In that case, keep width / height equal to the real input frame and set model_width / model_height to the detector size used by post-processing:

#define V5_RGB_WIDTH   1280
#define V5_RGB_HEIGHT  720
#define NN_WIDTH       416
#define NN_HEIGHT      416

nn_data_param_t nn_input_params = {
    .img = {
        .width = V5_RGB_WIDTH,
        .height = V5_RGB_HEIGHT,
        .model_width = NN_WIDTH,
        .model_height = NN_HEIGHT,
    },
    .codec_type = AV_CODEC_ID_RGB888
};

Note

The codec_type must match the output format of the upstream V5 video module. Use VIDEO_RGB + AV_CODEC_ID_RGB888 for models that require an RGB input. width and height describe the full incoming frame; ROI is no longer configured in nn_data_param_t. If the frame size differs from the network tensor size and the model does not contain its own preprocessing layer, the model preprocessing code resizes the full frame before inference.

Setting the NN Model

Each supported model is represented by an nnmodel_t object that bundles the model binary path, pre-processing, and post-processing functions together.

#include "model_yolo.h"
#define NN_MODEL_OBJ    yolov4_tiny
#define NN_MODEL_NAME   "vfs:/yolov4_tiny_asymu8.nb"

mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL,           (int)&NN_MODEL_OBJ);
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL_FILE_NAME, (int)NN_MODEL_NAME);

Setting the Result Callback

Register a callback with CMD_VIPNN_SET_DISPPOST to receive inference results after each frame. The callback runs in the VIPNN task context - keep it short and non-blocking:

static void nn_display_cb(void *p, void *img_param)
{
    vipnn_out_buf_t *out = (vipnn_out_buf_t *)p;
    objdetect_res_t *res = (objdetect_res_t *)&out->res[0];
    int obj_num = out->res_cnt;

    for (int i = 0; i < obj_num; i++) {
        RTK_LOGI(TAG, "class=%d score=%.2f [%.2f %.2f %.2f %.2f]\r\n",
                 (int)res[i].result[0], res[i].result[1],
                 res[i].result[2], res[i].result[3],
                 res[i].result[4], res[i].result[5]);
    }
}

mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_DISPPOST, (int)nn_display_cb);

Setting Detection Thresholds

For object detection and face detection models, two post-processing thresholds control result filtering:

static float nn_confidence_thresh = 0.5;   // minimum score to keep a detection
static float nn_nms_thresh        = 0.3;   // IoU threshold for NMS suppression

mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_CONFIDENCE_THRES, (int)&nn_confidence_thresh);
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_NMS_THRES,        (int)&nn_nms_thresh);

Increasing nn_confidence_thresh reduces false positives but may cause low-confidence detections to be dropped. Increasing nn_nms_thresh allows detections with higher bounding-box overlap to coexist.

Filtering by Class ID

Use CMD_VIPNN_SET_DESIRED_CLASS to restrict output to a specific object class ID. This is useful when running a multi-class model (such as YOLO trained on COCO) but the application only needs one class - for example, detecting only people (class 0 in COCO):

static int desired_class = 0;   // 0 = person in COCO
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_DESIRED_CLASS, (int)&desired_class);

Cascaded Mode

Cascaded mode connects two VIPNN module instances in series. The result of a first-stage model (e.g. SCRFD face detection) is passed directly as input to a second-stage model (e.g. MobileFaceNet face recognition), enabling a detect-then-recognise pipeline without writing custom glue code between the two stages.

Set cas_mode on the downstream VIPNN module to enable it as a cascaded consumer. The upstream frame size and optional model_width / model_height are propagated to the cascaded input. MobileFaceNet now derives its face crop ROI from the previous SCRFD detection result inside model preprocessing, while landmarks are carried in nn_data_param_t.img.landmark for face alignment. Refer to the face recognition example in the SDK for the complete two-module setup.

The SDK face recognition example uses these additional VIPNN controls:

mm_module_ctrl(facedet_ctx, CMD_VIPNN_SET_OUTPUT, 1);
mm_module_ctrl(facedet_ctx, MM_CMD_SET_DATAGROUP, MM_GROUP_START);

mm_module_ctrl(facenet_ctx, CMD_VIPNN_SET_CASCADE, VIPNN_CMODE_ALL_ROI);
mm_module_ctrl(facenet_ctx, CMD_VIPNN_SET_OUTPUT, 1);
mm_module_ctrl(facenet_ctx, MM_CMD_SET_DATAGROUP, MM_GROUP_END);

CMD_VIPNN_SET_OUTPUT lets the first-stage result continue downstream. VIPNN_CMODE_ALL_ROI runs MobileFaceNet once for each detected face ROI instead of only the first ROI.

facerecog Module

facerecog_module consumes MobileFaceNet feature results, compares them with a registered identity database, and calls an application-provided draw callback with names and bounding boxes. It is compiled into the MMF module list as module_facerecog.c and is used by mmf2_video_example_nn_face_recognition_init.c.

The module stores up to MAX_FRC_REG_NUM (20) registered identities in RAM. CMD_FRC_SAVE_FEATURES writes them to vfs:/face_feature.bin with a CRC; CMD_FRC_LOAD_FEATURES reloads the file at runtime.

facerecog_ctx = mm_module_open(&facerecog_module);
mm_module_ctrl(facerecog_ctx, CMD_FRC_SET_THRES100, 99);
mm_module_ctrl(facerecog_ctx, CMD_FRC_SET_OSD_DRAW, (int)face_recognition_draw_object);
mm_module_ctrl(facerecog_ctx, CMD_FRC_LOAD_FEATURES, 0);

Command	Description
`CMD_FRC_SET_THRES100`	Set similarity threshold as an integer percentage. `99` means 0.99.
`CMD_FRC_SET_OSD_DRAW`	Register the draw callback that receives `frc_draw_t`.
`CMD_FRC_REGISTER_MODE`	Register the next single detected face under the supplied name.
`CMD_FRC_RECOGNITION_MODE`	Return to recognition mode.
`CMD_FRC_LOAD_FEATURES`	Load registered features from `vfs:/face_feature.bin`.
`CMD_FRC_SAVE_FEATURES`	Save registered features to `vfs:/face_feature.bin`.
`CMD_FRC_RESET_FEATURES`	Clear registered features in RAM.
`CMD_FRC_LIST_FEATURES`	Print registered identity names.

Module Command Reference

Command	Description
`CMD_VIPNN_SET_MODEL`	Set the model object (`nnmodel_t *`)
`CMD_VIPNN_SET_MODEL_FILE_NAME`	Set the model binary file path string (e.g. `"vfs:/model.nb"`)
`CMD_VIPNN_SET_IN_PARAMS`	Set input frame descriptor (`nn_data_param_t *`)
`CMD_VIPNN_SET_DISPPOST`	Register result callback (`disp_postprcess_t`)
`CMD_VIPNN_SET_RES_SIZE`	Set sizeof one result structure
`CMD_VIPNN_SET_RES_MAX_CNT`	Set maximum number of results per frame
`CMD_VIPNN_SET_CONFIDENCE_THRES`	Set detection confidence threshold (`float *`)
`CMD_VIPNN_SET_NMS_THRES`	Set NMS IoU threshold (`float *`)
`CMD_VIPNN_SET_DESIRED_CLASS`	Filter output to a specific object class ID (`int *`)
`CMD_VIPNN_SET_OUTPUT`	Enable module output so downstream MMF modules can consume VIPNN results
`CMD_VIPNN_SET_OUTPUT_TYPE`	Select normal or raw VIPNN output
`CMD_VIPNN_SET_CASCADE`	Enable cascaded mode (`VIPNN_CMODE_ONE_ROI` or `VIPNN_CMODE_ALL_ROI`)
`CMD_VIPN_SET_SAVE_OUT_TENSOR`	Store output tensors for debugging custom post-processing
`CMD_VIPNN_SET_USR_OUTPUT_BUF`	Use an application-provided output buffer
`CMD_VIPNN_APPLY`	Apply configuration and start the VIPNN module

For NN media example usage, see Media Example .

NN Memory Layout

The NPU uses a dedicated region in DDR for model weights, input/output tensors, and intermediate computation buffers. The NN DDR window is defined in the linker script ameba_layout.ld:

NN_DDR (rwx) : ORIGIN = 0x87400000, LENGTH = 12M

Address range: 0x87400000 - 0x87FFFFFF  (12 MB)

The complete DDR memory map for reference:

Region	Base	Size	Usage
`AP_DDR`	0x80420000	~28 MB	Application code, heap, stack
`EN_DDR`	0x82000000	32 MB	Encoder (H.264/HEVC) working memory
`VP_DDR`	0x84000000	48 MB	Video processor (ISP) working memory
`TG_DDR`	0x87000000	4 MB	Tile scaler / graphics engine
`NN_DDR`	0x87400000	12 MB	NPU model + tensor buffers

NPU Hardware Reference

Compute Performance

Data type	MACs / cycle	Notes
INT8	768 MACs -> ~1 TOPS	Default; best throughput
INT16 (DFP)	192 MACs	Higher numerical precision
FP16 / BF16	384 MACs	Floating-point; used for PPU-side layers

NPU Architecture

The NPU consists of three compute subsystems relevant to software developers:

Neural Network Engine (NNE)

A parallel MAC array with multiple convolution cores responsible for convolution, depthwise convolution, and GEMM (fully-connected) operations. This is the primary accelerator for standard DNN layers. Supports INT8, INT16, FP16, and BF16.

Parallel Processing Unit (PPU)

A SIMD programmable execution unit that handles:

Pre- and post-processing kernels (OpenCL / OpenVX)
Custom NN layers not natively supported by the NNE
Activation, normalisation, reshape, and other lightweight operators
IEEE 32-bit floating-point pipeline

Vision Engine (EVIS)

Hardware-accelerated image processing primitives: 3x3 filtering, bilinear interpolation (Lerp), histogram, packed image load/store, and dot products. Used by the runtime driver for input format conversion.

The NPU communicates with the SoC via an AXI bus and supports virtual memory with 32-bit physical addressing.

Supported Quantisation Formats

Format	Description	When to use
`asymu8`	Asymmetric unsigned INT8	Maximum throughput; default for most models
`dfpi16`	Symmetric INT16 (Dynamic Fixed Point)	Higher accuracy requirements
`fp16`	IEEE 16-bit floating point	Mixed-precision or PPU-executed layers
`bfp16`	Brain floating-point 16-bit tensor format	Floating-point outputs decoded by SDK utils

Supported Input Image Formats

The NPU natively processes the following image formats without extra conversion cost:

Format	Description
`VX_DF_IMAGE_RGB`	24-bit RGB888, 3-channel interleaved (BT.709)
`VX_DF_IMAGE_NV12`	YUV 4:2:0 semi-planar - native ISP output format
`VX_DF_IMAGE_RGBX`	32-bit RGBX (R, G, B + don’t-care byte)
`VX_DF_IMAGE_U8`	Unsigned 8-bit single-channel
`VX_DF_IMAGE_S16`	Signed 16-bit single-channel

In this video pipeline, the V5 ISP channel outputs RGB888 frames that feed directly into the VIPNN module without conversion.

Supported NN Layer Types

The NNE accelerates all standard DNN layer types. Custom or unsupported layers fall back to the PPU.

Category	Operations
Convolution	CONV3D, CONV2D, CONV1D, DECONVOLUTION, DECONVOLUTION1D, GROUPED_CONV2D, FCL2
Activation	RELU, LEAKY_RELU, PRELU, SIGMOID, TANH, SOFTMAX, LOG_SOFTMAX, SWISH, MISH, ELU, HARD_SIGMOID, CLIP, EXP, LOG, SQRT, RSQRT, ABS, NEG, LINEAR, SIN, ERF
Elementwise	ADD, SUBTRACT, MULTIPLY, DIVIDE, MAXIMUM, MINIMUM, POW, FLOORDIV, MATRIXMUL, RELATIONAL_OPS, LOGICAL_OPS, SELECT, ADDN
Normalisation	BATCH_NORM, LAYER_NORM, INSTANCE_NORM, GROUP_NORM, L2_NORMALIZE, MOMENTS
Reshape / Tensor	CONCAT, SLICE, SPLIT, RESHAPE, SQUEEZE, PERMUTE, PAD, REVERSE, SPACE2DEPTH, DEPTH2SPACE, BATCH2SPACE, SPACE2BATCH, STRIDED_SLICE, REDUCE, ARGMAX, ARGMIN, SHUFFLECHANNEL, RESIZE, EXPAND_BROADCAST
Recurrent (RNN)	LSTMUNIT, GRUCELL, GRU, SVDF
Pooling	MAX_POOL, AVG_POOL, ROI_POOL, POOLWITHARGMAX, UPSAMPLE
Miscellaneous	PROPOSAL, VARIABLE, DROPOUT, STACK, UNSTACK, REORG, GATHER, SCATTER_ND, ONE_HOT, CAST

Software Stack

The NPU runtime exposes the following APIs to application software:

OpenVX 1.3 + OpenVX 1.2 Neural Network Extension - primary NN inference API
OpenCL 3.0 / 1.2 Full Profile - for custom compute kernels on the PPU
Proprietary Extensions for CNN - vendor extensions for NN acceleration and custom layers
VIP Lite API (vip_lite.h) - low-level NPU control, used internally by vipnn_module

Neural network models trained in common AI frameworks (Keras, TensorFlow, TFLite, PyTorch, Caffe, ONNX, Darknet) are converted offline to a compiled network binary (.nb) using the Acuity Toolkit, then deployed at runtime from the LittleFS flash partition (vfs:) or SD card (sd:).

Custom Model Conversion

If you have a custom-trained model and need to convert it to the .nb format for use on the NPU, please contact your Realtek representative or sales contact to request further assistance.