AIVoice Overview

Supported ICs

AFE single mic (ASR mode)
- RTL8721Dx
- RTL8726E
- RTL8713E
- RTL8730E
AFE single mic (COM mode)
AFE dual mic (ASR mode)
KWS fixed keyword
- RTL8721Dx
- RTL8726E
- RTL8713E
- RTL8730E
KWS user-defined keyword
VAD
- RTL8721Dx
- RTL8726E
- RTL8713E
- RTL8730E
ASR

Overview

AIVoice is an offline AI solution developed by Realtek, including local algorithm modules like Audio Front End (Signal Processing), Keyword Spotting, Voice Activity Detection, Speech Recognition etc. It can be used to build smart voice related applications on Realtek Ameba SoCs.

AIVoice can be used as a purely offline solution on its own, or it can be combined with cloud systems such as voice recognition, LLMs to create a hybrid online and offline voice interaction solution.

Applications

Application solutions
- Pure Offline: Standalone AIVoice usage supporting local wake-up, recognition, and other functions.
- Offline-Online Hybrid: AIVoice integrated with cloud-based systems (e.g., speech recognition, LLM) for local wake-up followed by cloud interaction.
Application products
- Smart Home: Smart speakers like Amazon Echo and Google Nest, or smart home appliances. Control lighting, temperature, and other smart devices through voice commands.
- Smart Toys: AI story machines, educational robots, companion robots etc. These toys can engage in natural conversations with users, answering questions, telling stories, or providing bilingual education.
- In-Car Systems: Enable drivers to navigate, make calls, and play music using voice commands, ensuring driving safety and improving the driving experience.
- Wearable Products: Smartwatches, smart headphones, and health monitoring devices etc. User can use voice control to check and send messages, control music player, answer calls etc.
- Meeting Scenarios: Transcribe meeting content in real-time, helping participants better record and review discussion points.

Modules

Modules	Functions
AFE (Audio Front End)	Enhancing speech signals and reducing noise, including submodules: AEC, Beamforming, NS, AGC, SSL
KWS (Keyword Spotting)	Detecting specific wakeup words to trigger voice assistants, such as `Hey siri`, `Alexa`
VAD (Voice Activity Detection)	Detecting speech segments or noise segments
ASR (Automatic Speech Recognition)	Detecting offline voice control commands

Flows

Some algorithm flows have been implemented to facilitate user development.

Full Flow: An offline full flow including AFE, KWS and ASR. AFE and KWS are always-on, ASR turns on and supports continuous recognition when KWS detects the keyword. ASR exits after timeout.
AFE+KWS: Offline flow including AFE and KWS, always-on.
AFE+KWS+VAD: Offline flow including AFE, KWS and VAD. AFE and KWS are always-on, VAD turns on and supports continuous activity detention when KWS detects the keyword. VAD exits after timeout.

File Path

Chip	OS	aivoice_lib_dir	aivoice_example_dir
RTL8730E	Linux	{LINUXSDK}/apps/aivoice	{LINUXSDK}/apps/aivoice/example
RTL8721Dx/RTL8730E	FreeRTOS	{RTOSSDK}/component/aivoice	{RTOSSDK}/component/example/aivoice
RTL8713E/RTL8726E	FreeRTOS	{DSPSDK}/lib/aivoice	{DSPSDK}/example/aivoice

Interface

Flow and Module Interfaces

Interface	Flow/Module
aivoice_iface_full_flow_v1	AFE+KWS+ASR
aivoice_iface_afe_kws_v1	AFE+KWS
aivoice_iface_afe_kws_vad_v1	AFE+KWS+VAD
aivoice_iface_afe_v1	AFE
aivoice_iface_vad_v1	VAD
aivoice_iface_kws_v1	KWS
aivoice_iface_asr_v1	ASR

All interfaces support below functions:

create()
destroy()
reset()
feed()

Please refer to ${aivoice_lib_dir}/include/aivoice_interface.h for details.

Event and Callback Message

aivoice_out_event_type	Event trigger time	Callback message
AIVOICE_EVOUT_VAD	When VAD detects start or end point of a speech segment	Struct includes VAD status, offset.
AIVOICE_EVOUT_WAKEUP	When KWS detects keyword	JSON string includes ID, keyword, and score. Example: {“id”:2,”keyword”:”ni-hao-xiao-qiang”,”score”:0.9}
AIVOICE_EVOUT_ASR_RESULT	When ASR detects command word	JSON string includes FST type, commands and ID. Example: {“type”:0,”commands”:[{“rec”:”play music”,”id”:14}]}
AIVOICE_EVOUT_AFE	Every frame when AFE got input	Struct includes AFE output data, channel number, etc.
AIVOICE_EVOUT_ASR_REC_TIMEOUT	When ASR/VAD exceed timeout duration	NULL

AFE Event Definition

struct aivoice_evout_afe {
    int     ch_num;                       /* channel number of output audio signal, default: 1 */
    short*  data;                         /* enhanced audio signal samples */
    char*   out_others_json;              /* reserved for other output data, like flags, key: value */
};

VAD Event Definition

struct aivoice_evout_vad {
    int status;                     /*  0: vad is changed from speech to silence,
                                           indicating the end point of a speech segment
                                        1: vad is changed from silence to speech,
                                           indicating the start point of a speech segment */
    unsigned int offset_ms;         /* time offset relative to reset point. */
};

Common Configurations

AIVoice configurable parameters:

no_cmd_timeout:: In full flow, ASR exits when no command word detected during this duration. In AFE+KWS+VAD flow, VAD works only within this duration after a keyword detected.
memory_alloc_mode:: Default mode uses SDK default heap. SRAM mode uses SDK default heap while also allocate space from SRAM for memory critical data. SRAM mode is ONLY available on RTL8713E and RTL8726E DSP now.

Refer to ${aivoice_lib_dir}/include/aivoice_sdk_config.h for details.

Example

AIVoice Offline Example: Full Flow with Pre-recorded Audio

This example shows how to use AIVoice full flow with a pre-recorded 3 channel audio and will run only once after EVB reset. Audio functions such as recording and playback are not integrated.

Example code is under ${aivoice_example_dir}/full_flow_offline.

Steps of Using AIVoice

Select aivoice flow or modules needed.

/* step 1:
 * Select the aivoice flow you want to use.
 * Refer to the end of aivoice_interface.h to see which flows are supported.
 */
const struct rtk_aivoice_iface *aivoice = &aivoice_iface_full_flow_v1;

Build configuration.

/* step 2:
 * Modify the default configure if needed.
 * You can modify 0 or more configures of afe/vad/kws/...
 */
struct aivoice_config config;
memset(&config, 0, sizeof(config));

/*
 * here we use afe_res_2mic50mm for example.
 * you can change these configuratons according the afe resource you used.
 * refer to aivoce_afe_config.h for details;
 *
 * afe_config.mic_array MUST match the afe resource you linked.
 */
struct afe_config afe_param = AFE_CONFIG_ASR_DEFAULT_2MIC50MM; // change this according to the linked afe resource.
config.afe = &afe_param;

/*
 * ONLY turn on these settings when you are sure about what you are doing.
 * it is recommend to use the default configure,
 * if you do not know the meaning of these configure parameters.
 */
struct vad_config vad_param = VAD_CONFIG_DEFAULT();
vad_param.left_margin = 300; // you can change the configure if needed
config.vad = &vad_param;    // can be NULL

struct kws_config kws_param = KWS_CONFIG_DEFAULT();
config.kws = &kws_param;    // can be NULL

struct asr_config asr_param = ASR_CONFIG_DEFAULT();
config.asr = &asr_param;    // can be NULL

struct aivoice_sdk_config aivoice_param = AIVOICE_SDK_CONFIG_DEFAULT();
aivoice_param.no_cmd_timeout = 10;
config.common = &aivoice_param; // can be NULL

Use create() to create and initialize aivoice instance with given configuration.

/* step 3:
 * Create the aivoice instance.
 */
void *handle = aivoice->create(&config);
if (!handle) {
    return;
}

Register callback function.

/* step 4:
 * Register a callback function.
 * You may only receive some of the aivoice_out_event_type in this example,
 * depending on the flow you use.
 * */

rtk_aivoice_register_callback(handle, aivoice_callback_process, NULL);

The callback function can be modified according to user cases:

static int aivoice_callback_process(void *userdata,
                                    enum aivoice_out_event_type event_type,
                                    const void *msg, int len)
{

    (void)userdata;
    struct aivoice_evout_vad *vad_out;
    struct aivoice_evout_afe *afe_out;

    switch (event_type) {
    case AIVOICE_EVOUT_VAD:
            vad_out = (struct aivoice_evout_vad *)msg;
            printf("[user] vad. status = %d, offset = %d\n", vad_out->status, vad_out->offset_ms);
            break;

    case AIVOICE_EVOUT_WAKEUP:
            printf("[user] wakeup. %.*s\n", len, (char *)msg);
            break;

    case AIVOICE_EVOUT_ASR_RESULT:
            printf("[user] asr. %.*s\n", len, (char *)msg);
            break;

    case AIVOICE_EVOUT_ASR_REC_TIMEOUT:
            printf("[user] asr timeout\n");
            break;

    case AIVOICE_EVOUT_AFE:
            afe_out = (struct aivoice_evout_afe *)msg;

            // afe will output audio each frame.
            // in this example, we only print it once to make log clear
            static int afe_out_printed = false;
            if (!afe_out_printed) {
                    afe_out_printed = true;
                    printf("[user] afe output %d channels raw audio, others: %s\n",
                               afe_out->ch_num, afe_out->out_others_json ? afe_out->out_others_json : "null");
            }

            // process afe output raw audio as needed
            break;

    default:
            break;
    }

    return 0;
}

Use feed() to input audio data to aivoice.

/* when run on chips, we get online audio stream,
 * here we use a fix audio.
 * */

const char *audio = (const char *)get_test_wav();
int len = get_test_wav_len();
int audio_offset = 44;
int mics_num = 2;
int afe_frame_bytes = (mics_num + afe_param.ref_num) * afe_param.frame_size * sizeof(short);
while (audio_offset <= len - afe_frame_bytes) {
        /* step 5:
         * Feed the audio to the aivoice instance.
         * */

        aivoice->feed(handle,
                      (char *)audio + audio_offset,
                      afe_frame_bytes);

        audio_offset += afe_frame_bytes;
}

(Optional) Use reset() if status reset is needed.

Use destroy() to destroy the instance if aivoice is no longer needed.

/* step 6:
* Destroy the aivoice instance */

aivoice->destroy(handle);

Build Example

Switch to GCC project directory in SDK
```
cd {SDK}/amebadplus_gcc_project
```
Run menuconfig.py to enter the configuration interface
```
./menuconfig.py
```

Navigate through menu path to enable TFLM Library and AIVoice

--------MENUCONFIG FOR General---------
CONFIG TrustZone  --->
...
CONFIG APPLICATION  --->
   GUI Config  --->
   ...
   AI Config  --->
      [*] Enable TFLITE MICRO
      [*] Enable AIVoice

Build image
```
./build.py -a full_flow_offline
```

Build TFLM Library for DSP, refer to Build TFLM.

Or use the prebuilt TFLM Library in {DSPSDK}/lib/aivoice/prebuilts.
Import {DSPSDK}/example/aivoice/full_flow_offline source in Xtensa Xplorer.

Set software configurations and modify libraries such as AFE resource, KWS resource if needed.

Add include path (-I)

${workspace_loc}/../lib/aivoice/include

Add library search path (-L)

${workspace_loc}/../lib/aivoice/prebuilts/$(TARGET_CONFIG)
${workspace_loc}/../lib/xa_nnlib/v1.8.1/bin/$(TARGET_CONFIG)/Release
${workspace_loc}/../lib/lib_hifi5/project/hifi5_library/bin/$(TARGET_CONFIG)/Release
${workspace_loc}/../lib/tflite_micro/project/bin/$(TARGET_CONFIG)/Release

Add libraries (-l)

-laivoice -lafe_kernel -lafe_res_2mic50mm -lkernel -lvad -lkws -lasr -lfst -lcJSON -ltomlc99  -ltflite_micro -lxa_nnlib -lhifi5_dsp

Build image, refer to the steps in DSP Build.

Build TFLM Library for DSP, refer to Build TFLM.

Or use the prebuilt TFLM Library in {DSPSDK}/lib/aivoice/prebuilts.
Import {DSPSDK}/example/aivoice/full_flow_offline source in Xtensa Xplorer.

Set software configurations and modify libraries such as AFE resource, KWS resource if needed.

Add include path (-I)

${workspace_loc}/../lib/aivoice/include

Add library search path (-L)

${workspace_loc}/../lib/aivoice/prebuilts/$(TARGET_CONFIG)
${workspace_loc}/../lib/xa_nnlib/v1.8.1/bin/$(TARGET_CONFIG)/Release
${workspace_loc}/../lib/lib_hifi5/project/hifi5_library/bin/$(TARGET_CONFIG)/Release
${workspace_loc}/../lib/tflite_micro/project/bin/$(TARGET_CONFIG)/Release

Add libraries (-l)

-laivoice -lafe_kernel -lafe_res_2mic50mm -lkernel -lvad -lkws -lasr -lfst -lcJSON -ltomlc99  -ltflite_micro -lxa_nnlib -lhifi5_dsp

Build image, refer to the steps in DSP Build.

FreeRTOS

Switch to GCC project directory in SDK
```
cd {SDK}/amebasmart_gcc_project
```
Run menuconfig.py to enter the configuration interface
```
./menuconfig.py
```

Navigate through menu path to enable TFLM Library and AIVoice

--------MENUCONFIG FOR General---------
CONFIG TrustZone  --->
...
CONFIG APPLICATION  --->
   GUI Config  --->
   ...
   AI Config  --->
      [*] Enable TFLITE MICRO
      [*] Enable AIVoice

Select AFE Resource according to hardware, default is afe_res_2mic50mm

AI Config  --->
   [*] Enable TFLITE MICRO
   [*] Enable AIVoice
      Select AFE Resource
         ( ) afe_res_1mic
         ( ) afe_res_2mic30mm
         (X) afe_res_2mic50mm
         ( ) afe_res_2mic70mm

Select KWS Resource, default is fixed keyword xiao-qiang-xiao-qiang ni-hao-xiao-qiang

AI Config  --->
   [*] Enable TFLITE MICRO
   [*] Enable AIVoice
      Select AFE Resource
      Select KWS Resource
         (X) kws_res_xqxq
         ( ) kws_res_custom

Build image
```
./build.py -a full_flow_offline
```

Linux

(Optional) Modify yocto recipe {LINUXSDK}/yocto/meta-realtek/meta-sdk/recipes-rtk/aivoice/rtk-aivoice-algo.bb to change library such as AFE resource, KWS resource if needed.
Compile the aivoice algo image using bitbake:
```
bitbake rtk-aivoice-algo
```

Expected Result

Download image to EVB, after running, the logs should display the algorithm results as follows:

[AIVOICE] set multi kws mode
---------------------SPEECH COMMANDS---------------------
Command ID1, 打开空调
Command ID2, 关闭空调
Command ID3, 制冷模式
Command ID4, 制热模式
Command ID5, 加热模式
Command ID6, 送风模式
Command ID7, 除湿模式
Command ID8, 调到十六度
Command ID9, 调到十七度
Command ID10, 调到十八度
Command ID11, 调到十九度
Command ID12, 调到二十度
Command ID13, 调到二十一度
Command ID14, 调到二十二度
Command ID15, 调到二十三度
Command ID16, 调到二十四度
Command ID17, 调到二十五度
Command ID18, 调到二十六度
Command ID19, 调到二十七度
Command ID20, 调到二十八度
Command ID21, 调到二十九度
Command ID22, 调到三十度
Command ID23, 开高一度
Command ID24, 开低一度
Command ID25, 高速风
Command ID26, 中速风
Command ID27, 低速风
Command ID28, 增大风速
Command ID29, 减小风速
Command ID30, 自动风
Command ID31, 最大风量
Command ID32, 中等风量
Command ID33, 最小风量
Command ID34, 自动风量
Command ID35, 左右摆风
Command ID36, 上下摆风
Command ID37, 播放音乐
Command ID38, 暂停播放
Command ID39, 接听电话
Command ID40, 挂断电话
---------------------------------------------------------

[AIVOICE] rtk_aivoice version: v1.5.0#S0825120#N1ed33d6#A6c25e38
[AIVOICE] rtk_aivoice_model afe version: afe_2mic_asr_v1.3.1_AfePara_2mic50_v2.0_bf_v0.0_20250401
[AIVOICE] rtk_aivoice_model vad version: vad_v7_opt
[AIVOICE] rtk_aivoice_model kws version: kws_xqxq_v4.1_opt
[AIVOICE] rtk_aivoice_model asr version: asr_cn_v8_opt
[AIVOICE] rtk_aivoice_log_format version: v2
[user] afe output 1 channels raw audio, others: {"abnormal_flag":0,"ssl_angle":-10}
[AIVOICE] [KWS] result: {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.7746397852897644}
[user] wakeup. {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.7746397852897644}
[user] voice angle 90.0
[user] vad. status = 1, offset = 385
[user] vad. status = 0, offset = 1865
[AIVOICE] [ASR] result: {"type":0,"commands":[{"rec":"打开空调","id":1}]}
[user] asr. {"type":0,"commands":[{"rec":"打开空调","id":1}]}
[user] voice angle 90.0
[user] vad. status = 1, offset = 525
[AIVOICE] [KWS] result: {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.750707507133484}
[user] wakeup. {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.750707507133484}
[user] voice angle 90.0
[user] vad. status = 1, offset = 445
[user] vad. status = 0, offset = 1765
[AIVOICE] [ASR] result: {"type":0,"commands":[{"rec":"播放音乐","id":37}]}
[user] asr. {"type":0,"commands":[{"rec":"播放音乐","id":37}]}
[user] voice angle 90.0

Glossary

AEC: Acoustic Echo Cancellation, or echo cancellation, refers to removing the echo signal from the input signal. The echo signal is generated by a sound played through the speaker of the device then captured by the microphone.
AFE: Audio Front End, refers to a combination of modules for preprocessing raw audio signals. It’s usually performed to improve the quality of speech signal before the voice interaction, including several speech enhancement algorithms.
AGC: Automatic Gain Control, an algorithm that dynamically controls the gain of a signal and automatically adjust the amplitude to maintain an optimal signal strength.
ASR: Automatic Speech Recognition, or Speech-to-Text, refers to recognition of spoken language from audio into text. It can be used to build voice-user interface to enable spoken human interaction with AI devices.
BF: BeamForming, refers to a spatial filter designed for a microphone array to enhance the signal from a specific direction and attenuate signals from other directions.
KWS: Keyword Spotting, or wakeup word detection, refers to identifying specific keywords from audio. It is usually the first step in a voice interaction system. The device will enter the state of waiting voice commands after detecting the keyword.
NN: Neural Network, is a machine learning model used for various task in artificial intelligence. Neural networks rely on training data to learn and improve their accuracy.
NS: Noise Suppression, or noise reduction, refers to suppressing ambient noises in the signal to enhance the speech signal, especially stationary noises.
RES: Residual Echo Suppression, refers to suppressing the remained echo signal after AEC processing. It is a postfilter for AEC.
SSL: Sound Source Localization, or direction of arrival (DOA), refers to estimating the spatial location of a sound source using a microphone array.
TTS: Text-To-Speech, or speech synthesis, is a technology that converts text into spoken audio. It can be used in any speech-enabled application that requires converting text to speech imitating human voice.
VAD: Voice Activity Detection, or speech activity detection, is a binary classifier to detect the presence or absence of human speech. It is widely used in speech enhancement, ASR system etc, and can also be used to deactivate some processes during non-speech section of an audio session, saving on computation or bandwidth.

Product Overview

SoCs

Select SoC via Applications

Internet of Things(IoT)

Wi-Fi Audio

Smart Display

Smart Voice

Carplay Box

Select SoC via Features

HiFi DSP Series

Cortex-A Linux Series

Display Series

Audio Series

Image Signal Processing Series

Select SoC via Features

Wi-Fi 6 + BLE Series

Wi-Fi 2.4G/5G + BLE Series

Wi-Fi + Classic BT Series

Wi-Fi R-MESH Series

Wi-Fi Ulta-Low-power

Media & Entertainment

Audio Solutions

Audio Front-End Algorithms

Wi-Fi

System

System Security

AI Voice

Audio Front-End Algorithms

Multimedia

SDK and Resources

FreeRTOS

Linux

HiFi DSP

Zephyr

Tools

VSCode User Guide

Hardware Design

Datasheet

Support