AI NPU Module

Supported ICs: [RTL8735C]

Overview

The platform integrates a dedicated Neural Processing Unit (NPU) that offloads Deep Neural Network (DNN) computation from the main CPU, enabling real-time AI inference - such as object detection and face recognition - with low power consumption.

At INT8 precision, the NPU delivers approximately 1 TOPS at the 600 MHz operating frequency, backed by 256 KB on-chip SRAM (VIP_SRAM) for intermediate tensors and a dedicated 12 MB DDR region for model weights and I/O buffers (see NN Memory Layout).

For detailed hardware architecture, supported layer types, quantisation formats, and software stack, see NPU Hardware Reference.

Integration Overview

All NN examples follow the same integration pattern. The application opens a V5 video channel as an RGB (or NV12) source, passes frames to the VIPNN module for NPU inference, and receives structured results through a result callback function.

[Video V5: RGB NN_WIDTH x NN_HEIGHT @ fps] --> SISO --> [VIPNN module]
                                                              |
                                                       nn_display_cb()

Steps to get NN running:

  1. Select a pre-built model from Pre-Built Model Library (or request a custom model - see Custom Model Conversion).

  2. Get the .nb model binary onto the device file system - see Deploying Models to Device.

  3. Configure the VIPNN module in your application - see VIPNN Module.

  4. Implement the result callback to act on inference output.

Pre-Built Model Library

The following pre-compiled .nb model binaries are included in the SDK:

component/soc/<soc>/video/nn/app/nn_model/binary/

To use a model, the .nb file must be present on the device file system before running the application. See Deploying Models to Device for how to flash or copy model files to the device.

Object Detection

YOLO Series

YOLO (You Only Look Once) is a widely used real-time object detection algorithm. The following variants are provided:

Model object

Binary filename

Input size

Quantised

yolov4_tiny

yolov4_tiny_asymu8.nb

416 x 416

uint8

yolov7_tiny

yolov7_tiny_576x320_asymu8.nb

576 x 320

uint8

yolov7_tiny

yolov7_tiny_640x480_asymu8.nb

640 x 480

uint8

yolov9_tiny

yolov9_tiny_dfpi16.nb

416 x 416

int16 DFP

Include the corresponding header and select the model object at compile time:

#include "model_yolo.h"    // yolov4_tiny, yolov7_tiny
#include "model_yolov9.h"  // yolov9_tiny

#define NN_MODEL_OBJ    yolov7_tiny
#define NN_MODEL_NAME   "vfs:/yolov7_tiny_576x320_asymu8.nb"
#define NN_WIDTH        576
#define NN_HEIGHT       320

The output of each detected object is stored as objdetect_res_t:

typedef struct objdetect_res_s {
    union {
        float result[6];   // [class_id, score, top_x, top_y, bot_x, bot_y]
        detobj_t res;
    };
} objdetect_res_t;

All coordinates are normalised to [0.0, 1.0] relative to the logical detector input size. In the common case this is the network tensor size. If the .nb model includes an in-graph preprocessing or scaling layer, the application can provide model_width / model_height in nn_data_param_t so YOLO post-processing decodes boxes against the logical detector size. The YOLO models are trained on the COCO dataset (80 classes).

For more information: https://github.com/AlexeyAB/darknet

Face Detection

SCRFD

SCRFD (Sample and Computation Redistribution for Face Detection) is a lightweight, high-accuracy face detector that outputs bounding boxes and 5-point facial landmarks.

Model object

Binary filename

Input size

Quantised

scrfd

scrfd_500m_bnkps_shape576x320.nb

576 x 320

uint8

#include "model_scrfd.h"

#define NN_MODEL_OBJ    scrfd
#define NN_MODEL_NAME   "vfs:/scrfd_500m_bnkps_shape576x320.nb"
#define NN_WIDTH        576
#define NN_HEIGHT       320

The detection result is stored as facedetect_res_t:

typedef struct facedetect_res_s {
    union {
        float result[6];    // [class_id, score, top_x, top_y, bot_x, bot_y]
        detobj_t res;
    };
    landmark_t landmark;    // 5 facial landmark points (x, y) normalised to [0.0, 1.0]
} facedetect_res_t;

For more information: https://github.com/deepinsight/insightface/tree/master/detection/scrfd

Face Recognition

MobileFaceNet

MobileFaceNet is a compact face recognition model trained with ArcFace (Additive Angular Margin Loss). It takes a cropped and aligned face image and outputs a 128-dimensional feature embedding for identity matching.

MobileFaceNet is typically used together with SCRFD in a cascaded detect-then-recognise pipeline: SCRFD first detects and localises faces in the full frame, then MobileFaceNet extracts an embedding from each cropped face region for comparison against a stored identity database. See Cascaded Mode for the VIPNN configuration.

Model object

Binary filename

Input size

Quantised

mbfacenet_fwfs

mobilefacenet_pcqsymi8.nb

112 x 112

int8 sym

#include "model_mobilefacenet.h"

The recognition result is stored as face_feature_res_t:

#define MAX_FACE_FEATURE_DIM 128
typedef struct face_feature_res_s {
    union {
        float result[6];
        detobj_t res;
    };
    float feature[MAX_FACE_FEATURE_DIM];  // 128-dim face embedding
} face_feature_res_t;

For more information: https://github.com/deepinsight/insightface/tree/master/recognition

Model Memory and File Size Reference

The following table lists the memory footprint of each SDK model. The DDR memory column represents the NPU runtime memory (model weights + I/O tensor buffers). All models fit within the 12 MB NN_DDR window.

Category

Model binary

Input size

Quantised

DDR memory usage

File size

Object detection

yolov4_tiny_asymu8.nb

416 x 416

uint8

6.51 MB

3.59 MB

Object detection

yolov7_tiny_576x320_asymu8.nb

576 x 320

uint8

7.25 MB

3.96 MB

Object detection

yolov7_tiny_640x480_asymu8.nb

640 x 480

uint8

10.13 MB

3.83 MB

Object detection

yolov9_tiny_dfpi16.nb

416 x 416

int16 DFP

10.22 MB

4.73 MB

Face detection

scrfd_500m_bnkps_shape576x320.nb

576 x 320

uint8

2.28 MB

0.75 MB

Face recognition

mobilefacenet_pcqsymi8.nb

112 x 112

int8 sym

2.06 MB

1.40 MB

Note

When running two models simultaneously (e.g. SCRFD + MobileFaceNet for detect-then-recognise), ensure the combined DDR memory usage does not exceed the 12 MB NN_DDR budget. The SCRFD + MobileFaceNet combination uses approximately 4.3 MB in total.


Deploying Models to Device

The VIPNN module loads model binaries at runtime using a path prefix:

  • vfs:/ - reads from the internal LittleFS flash partition (VFS1)

  • sd:/ - reads from an SD card

For quick prototyping, copying the .nb file to the root of an SD card and setting sd:/model.nb as the model path requires no additional build steps. For production or devices without an SD card slot, use the vfs: path described below.

Flashing Models to LittleFS (VFS1)

VFS1 is the LittleFS flash partition defined in the flash layout (component/soc/usrcfg/<soc>/ameba_flashcfg.c). The current SDK stores Wi-Fi, BT, and NN data in this single LittleFS region because only one LittleFS flash region is supported at runtime:

{VFS1, 0x088A3000, 0x08EA2FFF}   /* VFS region 1: wifi/BT/NN data (6 MB) */
{VFS2, 0xFFFFFFFF, 0xFFFFFFFF}

Note

The address range shown above is for reference only. Always verify the actual VFS1 partition address and size in component/soc/usrcfg/<soc>/ameba_flashcfg.c before flashing, as the layout may differ depending on your firmware configuration.

Step 1 - Prepare the model directory

Create a local directory under tools/littlefs/linux and copy the model files used by the SDK NN video examples into it:

cd tools/littlefs/linux
mkdir -p nn_model
cp ../../../component/soc/<soc>/video/nn/app/nn_model/binary/mobilefacenet_pcqsymi8.nb nn_model/
cp ../../../component/soc/<soc>/video/nn/app/nn_model/binary/scrfd_500m_bnkps_shape576x320.nb nn_model/
cp ../../../component/soc/<soc>/video/nn/app/nn_model/binary/yolov4_tiny_asymu8.nb nn_model/

This packs the three common example models into one LittleFS image:

  • mobilefacenet_pcqsymi8.nb for MobileFaceNet face embedding

  • scrfd_500m_bnkps_shape576x320.nb for SCRFD face detection

  • yolov4_tiny_asymu8.nb for YOLO object detection

Step 2 - Build the LittleFS image

Run the following command from tools/littlefs/linux:

./mklittlefs -b 4096 -p 4096 -s 0x600000 -c nn_model/ nn_model_lfs.bin

After the image is generated, list the files inside the LittleFS image to verify that the expected model files were packed:

./mklittlefs -b 4096 -p 4096 -s 0x600000 -l nn_model_lfs.bin

Example output:

1466168    /mobilefacenet_pcqsymi8.nb             Mon Nov 17 08:57:48 2025
 787568    /scrfd_500m_bnkps_shape576x320.nb      Mon Nov 17 08:57:48 2025
3763536    /yolov4_tiny_asymu8.nb                 Mon Nov 17 08:57:48 2025

Option

Description

-b 4096

Block size in bytes (matches the flash erase block size)

-p 4096

Page size in bytes

-s 0x600000

Image size - must match the VFS1 partition size (6 MB)

-c nn_model/

Input directory to pack

nn_model_lfs.bin

Output LittleFS image file

Step 3 - Flash the image

Flash nn_model_lfs.bin to the VFS1 start address (0x088A3000) using the image download tool. After a successful flash, the model files are accessible at runtime via the vfs: prefix:

#define FACEDET_MODEL_NAME   "vfs:/scrfd_500m_bnkps_shape576x320.nb"
#define FACENET_MODEL_NAME   "vfs:/mobilefacenet_pcqsymi8.nb"
#define OBJDET_MODEL_NAME    "vfs:/yolov4_tiny_asymu8.nb"

Note

The total size of all packed files must not exceed the VFS1 partition size (6 MB). The three-model example above is approximately 5.74 MB, which fits in the default 6 MB VFS1 image. Refer to Model Memory and File Size Reference for the file size of each pre-compiled model.


VIPNN Module

The NN MMF module - vipnn - accepts RGB or NV12 frames from the video pipeline, runs inference on the NPU, and delivers structured post-processed results to the application via a callback function.

Pre-processing and post-processing are bundled with each model object (nnmodel_t), so adding a new model requires only providing a new model object - the VIPNN module itself does not need to change.

VIPNN Module Context

The internal context of the VIPNN module:

typedef struct vipnn_ctx_s {
    void *parent;
    vip_network network;                              // NPU network handle
    vip_buffer_create_params_t vip_param_in[MAX_IO_NUM];
    vip_buffer_create_params_t vip_param_out[MAX_IO_NUM];
    vip_buffer input_buffers[MAX_IO_NUM];
    vip_buffer output_buffers[MAX_IO_NUM];
    vipnn_params_t params;                            // module parameters
    vipnn_status_t status;
    char network_name[64];
    int input_count;
    int output_count;
    vipnn_preproc_t  pre_process;                    // custom pre-process hook
    vipnn_postproc_t post_process;                   // custom post-process hook
    disp_postprcess_t disp_postproc;                 // result display callback
    vipnn_cascaded_mode_t cas_mode;
    bool module_out_en;
    vipnn_measure_t measure;                         // inference FPS measurement
} vipnn_ctx_t;

Module Parameters

The vipnn_params_t structure holds the runtime parameters for the module:

typedef struct vipnn_param_s {
    char     model_file[64];    // model file path on file system (e.g. "vfs:/model.nb")
    uint8_t *model_mem;         // pointer to model in memory (alternative to file path)
    uint32_t model_size;        // model size in bytes (when using model_mem)
    int      fps;               // target inference FPS (0 = unlimited)
    int      out_res_size;      // sizeof one result structure
    int      out_res_max_cnt;   // maximum number of results per frame
    int      save_out_tensor;   // set to 1 to dump raw output tensors for offline debugging
    nn_data_param_t *in_param;  // input image parameters
    nnmodel_t       *model;     // pointer to the model object
} vipnn_params_t;

The image part of nn_data_param_t describes the frame consumed by VIPNN:

typedef struct nn_data_param_s {
    union {
        struct {
            int width, height;
            int model_width, model_height;  // optional logical detector size
            landmarki_t landmark;
        } img;
        /* audio fields omitted */
    };
    uint32_t codec_type;
    void *priv;
    int size_in_byte;
} nn_data_param_t;

Note

Set save_out_tensor = 1 to dump raw NPU output tensors to a file. This is useful when developing or verifying custom post-processing logic on a PC. Disable this flag in production builds.

When model_mem is set (non-NULL), the module loads the model from that memory pointer instead of the file system. This is useful for embedding the model binary directly into firmware rather than storing it in a separate file system partition.

Complete Module Initialisation

static nn_data_param_t nn_input_params = {
    .img = {
        .width  = NN_WIDTH,
        .height = NN_HEIGHT,
    },
    .codec_type = AV_CODEC_ID_RGB888
};

vipnn_ctx = mm_module_open(&vipnn_module);
if (vipnn_ctx) {
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL,           (int)&NN_MODEL_OBJ);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL_FILE_NAME, (int)nn_model_file_name);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_IN_PARAMS,       (int)&nn_input_params);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_DISPPOST,        (int)nn_display_cb);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_RES_SIZE,        sizeof(objdetect_res_t));
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_RES_MAX_CNT,     32);
    mm_module_ctrl(vipnn_ctx, CMD_VIPNN_APPLY,               0);
}

Setting the Input Image Parameters

Use CMD_VIPNN_SET_IN_PARAMS to describe the input frame passed to the VIPNN module:

nn_data_param_t nn_input_params = {
    .img = {
        .width      = NN_WIDTH,   // incoming RGB/NV12 frame width
        .height     = NN_HEIGHT,  // incoming RGB/NV12 frame height
    },
    .codec_type = AV_CODEC_ID_RGB888   // or AV_CODEC_ID_NV12
};
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_IN_PARAMS, (int)&nn_input_params);

For models that include an in-graph preprocessing or scaling layer, the incoming V5 RGB frame size can differ from the logical detector input size. In that case, keep width / height equal to the real input frame and set model_width / model_height to the detector size used by post-processing:

#define V5_RGB_WIDTH   1280
#define V5_RGB_HEIGHT  720
#define NN_WIDTH       416
#define NN_HEIGHT      416

nn_data_param_t nn_input_params = {
    .img = {
        .width = V5_RGB_WIDTH,
        .height = V5_RGB_HEIGHT,
        .model_width = NN_WIDTH,
        .model_height = NN_HEIGHT,
    },
    .codec_type = AV_CODEC_ID_RGB888
};

Note

The codec_type must match the output format of the upstream V5 video module. Use VIDEO_RGB + AV_CODEC_ID_RGB888 for models that require an RGB input. width and height describe the full incoming frame; ROI is no longer configured in nn_data_param_t. If the frame size differs from the network tensor size and the model does not contain its own preprocessing layer, the model preprocessing code resizes the full frame before inference.

Setting the NN Model

Each supported model is represented by an nnmodel_t object that bundles the model binary path, pre-processing, and post-processing functions together.

#include "model_yolo.h"
#define NN_MODEL_OBJ    yolov4_tiny
#define NN_MODEL_NAME   "vfs:/yolov4_tiny_asymu8.nb"

mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL,           (int)&NN_MODEL_OBJ);
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_MODEL_FILE_NAME, (int)NN_MODEL_NAME);

Setting the Result Callback

Register a callback with CMD_VIPNN_SET_DISPPOST to receive inference results after each frame. The callback runs in the VIPNN task context - keep it short and non-blocking:

static void nn_display_cb(void *p, void *img_param)
{
    vipnn_out_buf_t *out = (vipnn_out_buf_t *)p;
    objdetect_res_t *res = (objdetect_res_t *)&out->res[0];
    int obj_num = out->res_cnt;

    for (int i = 0; i < obj_num; i++) {
        RTK_LOGI(TAG, "class=%d score=%.2f [%.2f %.2f %.2f %.2f]\r\n",
                 (int)res[i].result[0], res[i].result[1],
                 res[i].result[2], res[i].result[3],
                 res[i].result[4], res[i].result[5]);
    }
}

mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_DISPPOST, (int)nn_display_cb);

Setting Detection Thresholds

For object detection and face detection models, two post-processing thresholds control result filtering:

static float nn_confidence_thresh = 0.5;   // minimum score to keep a detection
static float nn_nms_thresh        = 0.3;   // IoU threshold for NMS suppression

mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_CONFIDENCE_THRES, (int)&nn_confidence_thresh);
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_NMS_THRES,        (int)&nn_nms_thresh);

Increasing nn_confidence_thresh reduces false positives but may cause low-confidence detections to be dropped. Increasing nn_nms_thresh allows detections with higher bounding-box overlap to coexist.

Filtering by Class ID

Use CMD_VIPNN_SET_DESIRED_CLASS to restrict output to a specific object class ID. This is useful when running a multi-class model (such as YOLO trained on COCO) but the application only needs one class - for example, detecting only people (class 0 in COCO):

static int desired_class = 0;   // 0 = person in COCO
mm_module_ctrl(vipnn_ctx, CMD_VIPNN_SET_DESIRED_CLASS, (int)&desired_class);

Cascaded Mode

Cascaded mode connects two VIPNN module instances in series. The result of a first-stage model (e.g. SCRFD face detection) is passed directly as input to a second-stage model (e.g. MobileFaceNet face recognition), enabling a detect-then-recognise pipeline without writing custom glue code between the two stages.

Set cas_mode on the downstream VIPNN module to enable it as a cascaded consumer. The upstream frame size and optional model_width / model_height are propagated to the cascaded input. MobileFaceNet now derives its face crop ROI from the previous SCRFD detection result inside model preprocessing, while landmarks are carried in nn_data_param_t.img.landmark for face alignment. Refer to the face recognition example in the SDK for the complete two-module setup.

The SDK face recognition example uses these additional VIPNN controls:

mm_module_ctrl(facedet_ctx, CMD_VIPNN_SET_OUTPUT, 1);
mm_module_ctrl(facedet_ctx, MM_CMD_SET_DATAGROUP, MM_GROUP_START);

mm_module_ctrl(facenet_ctx, CMD_VIPNN_SET_CASCADE, VIPNN_CMODE_ALL_ROI);
mm_module_ctrl(facenet_ctx, CMD_VIPNN_SET_OUTPUT, 1);
mm_module_ctrl(facenet_ctx, MM_CMD_SET_DATAGROUP, MM_GROUP_END);

CMD_VIPNN_SET_OUTPUT lets the first-stage result continue downstream. VIPNN_CMODE_ALL_ROI runs MobileFaceNet once for each detected face ROI instead of only the first ROI.

facerecog Module

facerecog_module consumes MobileFaceNet feature results, compares them with a registered identity database, and calls an application-provided draw callback with names and bounding boxes. It is compiled into the MMF module list as module_facerecog.c and is used by mmf2_video_example_nn_face_recognition_init.c.

The module stores up to MAX_FRC_REG_NUM (20) registered identities in RAM. CMD_FRC_SAVE_FEATURES writes them to vfs:/face_feature.bin with a CRC; CMD_FRC_LOAD_FEATURES reloads the file at runtime.

facerecog_ctx = mm_module_open(&facerecog_module);
mm_module_ctrl(facerecog_ctx, CMD_FRC_SET_THRES100, 99);
mm_module_ctrl(facerecog_ctx, CMD_FRC_SET_OSD_DRAW, (int)face_recognition_draw_object);
mm_module_ctrl(facerecog_ctx, CMD_FRC_LOAD_FEATURES, 0);

Command

Description

CMD_FRC_SET_THRES100

Set similarity threshold as an integer percentage. 99 means 0.99.

CMD_FRC_SET_OSD_DRAW

Register the draw callback that receives frc_draw_t.

CMD_FRC_REGISTER_MODE

Register the next single detected face under the supplied name.

CMD_FRC_RECOGNITION_MODE

Return to recognition mode.

CMD_FRC_LOAD_FEATURES

Load registered features from vfs:/face_feature.bin.

CMD_FRC_SAVE_FEATURES

Save registered features to vfs:/face_feature.bin.

CMD_FRC_RESET_FEATURES

Clear registered features in RAM.

CMD_FRC_LIST_FEATURES

Print registered identity names.

Module Command Reference

Command

Description

CMD_VIPNN_SET_MODEL

Set the model object (nnmodel_t *)

CMD_VIPNN_SET_MODEL_FILE_NAME

Set the model binary file path string (e.g. "vfs:/model.nb")

CMD_VIPNN_SET_IN_PARAMS

Set input frame descriptor (nn_data_param_t *)

CMD_VIPNN_SET_DISPPOST

Register result callback (disp_postprcess_t)

CMD_VIPNN_SET_RES_SIZE

Set sizeof one result structure

CMD_VIPNN_SET_RES_MAX_CNT

Set maximum number of results per frame

CMD_VIPNN_SET_CONFIDENCE_THRES

Set detection confidence threshold (float *)

CMD_VIPNN_SET_NMS_THRES

Set NMS IoU threshold (float *)

CMD_VIPNN_SET_DESIRED_CLASS

Filter output to a specific object class ID (int *)

CMD_VIPNN_SET_OUTPUT

Enable module output so downstream MMF modules can consume VIPNN results

CMD_VIPNN_SET_OUTPUT_TYPE

Select normal or raw VIPNN output

CMD_VIPNN_SET_CASCADE

Enable cascaded mode (VIPNN_CMODE_ONE_ROI or VIPNN_CMODE_ALL_ROI)

CMD_VIPN_SET_SAVE_OUT_TENSOR

Store output tensors for debugging custom post-processing

CMD_VIPNN_SET_USR_OUTPUT_BUF

Use an application-provided output buffer

CMD_VIPNN_APPLY

Apply configuration and start the VIPNN module


For NN media example usage, see Media Example .

NN Memory Layout

The NPU uses a dedicated region in DDR for model weights, input/output tensors, and intermediate computation buffers. The NN DDR window is defined in the linker script ameba_layout.ld:

NN_DDR (rwx) : ORIGIN = 0x87400000, LENGTH = 12M

Address range: 0x87400000 - 0x87FFFFFF  (12 MB)

The complete DDR memory map for reference:

Region

Base

Size

Usage

AP_DDR

0x80420000

~28 MB

Application code, heap, stack

EN_DDR

0x82000000

32 MB

Encoder (H.264/HEVC) working memory

VP_DDR

0x84000000

48 MB

Video processor (ISP) working memory

TG_DDR

0x87000000

4 MB

Tile scaler / graphics engine

NN_DDR

0x87400000

12 MB

NPU model + tensor buffers


NPU Hardware Reference

Compute Performance

Data type

MACs / cycle

Notes

INT8

768 MACs -> ~1 TOPS

Default; best throughput

INT16 (DFP)

192 MACs

Higher numerical precision

FP16 / BF16

384 MACs

Floating-point; used for PPU-side layers

NPU Architecture

The NPU consists of three compute subsystems relevant to software developers:

Neural Network Engine (NNE)

A parallel MAC array with multiple convolution cores responsible for convolution, depthwise convolution, and GEMM (fully-connected) operations. This is the primary accelerator for standard DNN layers. Supports INT8, INT16, FP16, and BF16.

Parallel Processing Unit (PPU)

A SIMD programmable execution unit that handles:

  • Pre- and post-processing kernels (OpenCL / OpenVX)

  • Custom NN layers not natively supported by the NNE

  • Activation, normalisation, reshape, and other lightweight operators

  • IEEE 32-bit floating-point pipeline

Vision Engine (EVIS)

Hardware-accelerated image processing primitives: 3x3 filtering, bilinear interpolation (Lerp), histogram, packed image load/store, and dot products. Used by the runtime driver for input format conversion.

The NPU communicates with the SoC via an AXI bus and supports virtual memory with 32-bit physical addressing.

Supported Quantisation Formats

Format

Description

When to use

asymu8

Asymmetric unsigned INT8

Maximum throughput; default for most models

dfpi16

Symmetric INT16 (Dynamic Fixed Point)

Higher accuracy requirements

fp16

IEEE 16-bit floating point

Mixed-precision or PPU-executed layers

bfp16

Brain floating-point 16-bit tensor format

Floating-point outputs decoded by SDK utils

Supported Input Image Formats

The NPU natively processes the following image formats without extra conversion cost:

Format

Description

VX_DF_IMAGE_RGB

24-bit RGB888, 3-channel interleaved (BT.709)

VX_DF_IMAGE_NV12

YUV 4:2:0 semi-planar - native ISP output format

VX_DF_IMAGE_RGBX

32-bit RGBX (R, G, B + don’t-care byte)

VX_DF_IMAGE_U8

Unsigned 8-bit single-channel

VX_DF_IMAGE_S16

Signed 16-bit single-channel

In this video pipeline, the V5 ISP channel outputs RGB888 frames that feed directly into the VIPNN module without conversion.

Supported NN Layer Types

The NNE accelerates all standard DNN layer types. Custom or unsupported layers fall back to the PPU.

Category

Operations

Convolution

CONV3D, CONV2D, CONV1D, DECONVOLUTION, DECONVOLUTION1D, GROUPED_CONV2D, FCL2

Activation

RELU, LEAKY_RELU, PRELU, SIGMOID, TANH, SOFTMAX, LOG_SOFTMAX, SWISH, MISH, ELU, HARD_SIGMOID, CLIP, EXP, LOG, SQRT, RSQRT, ABS, NEG, LINEAR, SIN, ERF

Elementwise

ADD, SUBTRACT, MULTIPLY, DIVIDE, MAXIMUM, MINIMUM, POW, FLOORDIV, MATRIXMUL, RELATIONAL_OPS, LOGICAL_OPS, SELECT, ADDN

Normalisation

BATCH_NORM, LAYER_NORM, INSTANCE_NORM, GROUP_NORM, L2_NORMALIZE, MOMENTS

Reshape / Tensor

CONCAT, SLICE, SPLIT, RESHAPE, SQUEEZE, PERMUTE, PAD, REVERSE, SPACE2DEPTH, DEPTH2SPACE, BATCH2SPACE, SPACE2BATCH, STRIDED_SLICE, REDUCE, ARGMAX, ARGMIN, SHUFFLECHANNEL, RESIZE, EXPAND_BROADCAST

Recurrent (RNN)

LSTMUNIT, GRUCELL, GRU, SVDF

Pooling

MAX_POOL, AVG_POOL, ROI_POOL, POOLWITHARGMAX, UPSAMPLE

Miscellaneous

PROPOSAL, VARIABLE, DROPOUT, STACK, UNSTACK, REORG, GATHER, SCATTER_ND, ONE_HOT, CAST

Software Stack

The NPU runtime exposes the following APIs to application software:

  • OpenVX 1.3 + OpenVX 1.2 Neural Network Extension - primary NN inference API

  • OpenCL 3.0 / 1.2 Full Profile - for custom compute kernels on the PPU

  • Proprietary Extensions for CNN - vendor extensions for NN acceleration and custom layers

  • VIP Lite API (vip_lite.h) - low-level NPU control, used internally by vipnn_module

Neural network models trained in common AI frameworks (Keras, TensorFlow, TFLite, PyTorch, Caffe, ONNX, Darknet) are converted offline to a compiled network binary (.nb) using the Acuity Toolkit, then deployed at runtime from the LittleFS flash partition (vfs:) or SD card (sd:).


Custom Model Conversion

If you have a custom-trained model and need to convert it to the .nb format for use on the NPU, please contact your Realtek representative or sales contact to request further assistance.