AIVoice Overview
Supported ICs
AFE single mic (ASR mode)
AFE single mic (COM mode)
AFE dual mic (ASR mode)
KWS fixed keyword
KWS user-defined keyword
VAD
ASR
Overview
AIVoice is an offline AI solution developed by Realtek, including local algorithm modules like Audio Front End (Signal Processing), Keyword Spotting, Voice Activity Detection, Speech Recognition etc. It can be used to build smart voice related applications on Realtek Ameba SoCs.
AIVoice can be used as a purely offline solution on its own, or it can be combined with cloud systems such as voice recognition, LLMs to create a hybrid online and offline voice interaction solution.
Applications
Application solutions
Pure Offline: Standalone AIVoice usage supporting local wake-up, recognition, and other functions.
Offline-Online Hybrid: AIVoice integrated with cloud-based systems (e.g., speech recognition, LLM) for local wake-up followed by cloud interaction.
Application products
Smart Home: Smart speakers like Amazon Echo and Google Nest, or smart home appliances. Control lighting, temperature, and other smart devices through voice commands.
Smart Toys: AI story machines, educational robots, companion robots etc. These toys can engage in natural conversations with users, answering questions, telling stories, or providing bilingual education.
In-Car Systems: Enable drivers to navigate, make calls, and play music using voice commands, ensuring driving safety and improving the driving experience.
Wearable Products: Smartwatches, smart headphones, and health monitoring devices etc. User can use voice control to check and send messages, control music player, answer calls etc.
Meeting Scenarios: Transcribe meeting content in real-time, helping participants better record and review discussion points.
Modules
Modules |
Functions |
---|---|
AFE (Audio Front End) |
Enhancing speech signals and reducing noise, including submodules: AEC, Beamforming, NS, AGC, SSL |
KWS (Keyword Spotting) |
Detecting specific wakeup words to trigger voice assistants, such as |
VAD (Voice Activity Detection) |
Detecting speech segments or noise segments |
ASR (Automatic Speech Recognition) |
Detecting offline voice control commands |
Flows
Some algorithm flows have been implemented to facilitate user development.
Full Flow: An offline full flow including AFE, KWS and ASR. AFE and KWS are always-on, ASR turns on and supports continuous recognition when KWS detects the keyword. ASR exits after timeout.
AFE+KWS: Offline flow including AFE and KWS, always-on.
AFE+KWS+VAD: Offline flow including AFE, KWS and VAD. AFE and KWS are always-on, VAD turns on and supports continuous activity detention when KWS detects the keyword. VAD exits after timeout.
File Path
Chip |
OS |
aivoice_lib_dir |
aivoice_example_dir |
---|---|---|---|
RTL8730E |
Linux |
{LINUXSDK}/apps/aivoice |
{LINUXSDK}/apps/aivoice/example |
RTL8721Dx/RTL8730E |
FreeRTOS |
{RTOSSDK}/component/aivoice |
{RTOSSDK}/component/example/aivoice |
RTL8713E/RTL8726E |
FreeRTOS |
{DSPSDK}/lib/aivoice |
{DSPSDK}/example/aivoice |
Interface
Flow and Module Interfaces
Interface |
Flow/Module |
---|---|
aivoice_iface_full_flow_v1 |
AFE+KWS+ASR |
aivoice_iface_afe_kws_v1 |
AFE+KWS |
aivoice_iface_afe_kws_vad_v1 |
AFE+KWS+VAD |
aivoice_iface_afe_v1 |
AFE |
aivoice_iface_vad_v1 |
VAD |
aivoice_iface_kws_v1 |
KWS |
aivoice_iface_asr_v1 |
ASR |
All interfaces support below functions:
create()
destroy()
reset()
feed()
Please refer to ${aivoice_lib_dir}/include/aivoice_interface.h
for details.
Event and Callback Message
aivoice_out_event_type |
Event trigger time |
Callback message |
---|---|---|
AIVOICE_EVOUT_VAD |
When VAD detects start or end point of a speech segment |
Struct includes VAD status, offset. |
AIVOICE_EVOUT_WAKEUP |
When KWS detects keyword |
JSON string includes ID, keyword, and score. Example: {“id”:2,”keyword”:”ni-hao-xiao-qiang”,”score”:0.9} |
AIVOICE_EVOUT_ASR_RESULT |
When ASR detects command word |
JSON string includes FST type, commands and ID. Example: {“type”:0,”commands”:[{“rec”:”play music”,”id”:14}]} |
AIVOICE_EVOUT_AFE |
Every frame when AFE got input |
Struct includes AFE output data, channel number, etc. |
AIVOICE_EVOUT_ASR_REC_TIMEOUT |
When ASR/VAD exceed timeout duration |
NULL |
AFE Event Definition
struct aivoice_evout_afe {
int ch_num; /* channel number of output audio signal, default: 1 */
short* data; /* enhanced audio signal samples */
char* out_others_json; /* reserved for other output data, like flags, key: value */
};
VAD Event Definition
struct aivoice_evout_vad {
int status; /* 0: vad is changed from speech to silence,
indicating the end point of a speech segment
1: vad is changed from silence to speech,
indicating the start point of a speech segment */
unsigned int offset_ms; /* time offset relative to reset point. */
};
Common Configurations
AIVoice configurable parameters:
- no_cmd_timeout:
In full flow, ASR exits when no command word detected during this duration. In AFE+KWS+VAD flow, VAD works only within this duration after a keyword detected.
- memory_alloc_mode:
Default mode uses SDK default heap. SRAM mode uses SDK default heap while also allocate space from SRAM for memory critical data. SRAM mode is ONLY available on RTL8713E and RTL8726E DSP now.
Refer to ${aivoice_lib_dir}/include/aivoice_sdk_config.h
for details.
Example
AIVoice Offline Example: Full Flow with Pre-recorded Audio
This example shows how to use AIVoice full flow with a pre-recorded 3 channel audio and will run only once after EVB reset. Audio functions such as recording and playback are not integrated.
Example code is under ${aivoice_example_dir}/full_flow_offline
.
Steps of Using AIVoice
Select aivoice flow or modules needed.
/* step 1: * Select the aivoice flow you want to use. * Refer to the end of aivoice_interface.h to see which flows are supported. */ const struct rtk_aivoice_iface *aivoice = &aivoice_iface_full_flow_v1;
Build configuration.
/* step 2: * Modify the default configure if needed. * You can modify 0 or more configures of afe/vad/kws/... */ struct aivoice_config config; memset(&config, 0, sizeof(config)); /* * here we use afe_res_2mic50mm for example. * you can change these configuratons according the afe resource you used. * refer to aivoce_afe_config.h for details; * * afe_config.mic_array MUST match the afe resource you linked. */ struct afe_config afe_param = AFE_CONFIG_ASR_DEFAULT_2MIC50MM; // change this according to the linked afe resource. config.afe = &afe_param; /* * ONLY turn on these settings when you are sure about what you are doing. * it is recommend to use the default configure, * if you do not know the meaning of these configure parameters. */ struct vad_config vad_param = VAD_CONFIG_DEFAULT(); vad_param.left_margin = 300; // you can change the configure if needed config.vad = &vad_param; // can be NULL struct kws_config kws_param = KWS_CONFIG_DEFAULT(); config.kws = &kws_param; // can be NULL struct asr_config asr_param = ASR_CONFIG_DEFAULT(); config.asr = &asr_param; // can be NULL struct aivoice_sdk_config aivoice_param = AIVOICE_SDK_CONFIG_DEFAULT(); aivoice_param.no_cmd_timeout = 10; config.common = &aivoice_param; // can be NULL
Use
create()
to create and initialize aivoice instance with given configuration./* step 3: * Create the aivoice instance. */ void *handle = aivoice->create(&config); if (!handle) { return; }
Register callback function.
/* step 4: * Register a callback function. * You may only receive some of the aivoice_out_event_type in this example, * depending on the flow you use. * */ rtk_aivoice_register_callback(handle, aivoice_callback_process, NULL);
The callback function can be modified according to user cases:
static int aivoice_callback_process(void *userdata, enum aivoice_out_event_type event_type, const void *msg, int len) { (void)userdata; struct aivoice_evout_vad *vad_out; struct aivoice_evout_afe *afe_out; switch (event_type) { case AIVOICE_EVOUT_VAD: vad_out = (struct aivoice_evout_vad *)msg; printf("[user] vad. status = %d, offset = %d\n", vad_out->status, vad_out->offset_ms); break; case AIVOICE_EVOUT_WAKEUP: printf("[user] wakeup. %.*s\n", len, (char *)msg); break; case AIVOICE_EVOUT_ASR_RESULT: printf("[user] asr. %.*s\n", len, (char *)msg); break; case AIVOICE_EVOUT_ASR_REC_TIMEOUT: printf("[user] asr timeout\n"); break; case AIVOICE_EVOUT_AFE: afe_out = (struct aivoice_evout_afe *)msg; // afe will output audio each frame. // in this example, we only print it once to make log clear static int afe_out_printed = false; if (!afe_out_printed) { afe_out_printed = true; printf("[user] afe output %d channels raw audio, others: %s\n", afe_out->ch_num, afe_out->out_others_json ? afe_out->out_others_json : "null"); } // process afe output raw audio as needed break; default: break; } return 0; }
Use
feed()
to input audio data to aivoice./* when run on chips, we get online audio stream, * here we use a fix audio. * */ const char *audio = (const char *)get_test_wav(); int len = get_test_wav_len(); int audio_offset = 44; int mics_num = 2; int afe_frame_bytes = (mics_num + afe_param.ref_num) * afe_param.frame_size * sizeof(short); while (audio_offset <= len - afe_frame_bytes) { /* step 5: * Feed the audio to the aivoice instance. * */ aivoice->feed(handle, (char *)audio + audio_offset, afe_frame_bytes); audio_offset += afe_frame_bytes; }
(Optional) Use
reset()
if status reset is needed.Use
destroy()
to destroy the instance if aivoice is no longer needed./* step 6: * Destroy the aivoice instance */ aivoice->destroy(handle);
Build Example
Switch to GCC project directory in SDK
cd {SDK}/amebadplus_gcc_project
Run
menuconfig.py
to enter the configuration interface./menuconfig.py
Navigate through menu path to enable TFLM Library and AIVoice
--------MENUCONFIG FOR General--------- CONFIG TrustZone ---> ... CONFIG APPLICATION ---> GUI Config ---> ... AI Config ---> [*] Enable TFLITE MICRO [*] Enable AIVoice
Build image
./build.py -a full_flow_offline
Build TFLM Library for DSP, refer to Build TFLM.
Or use the prebuilt TFLM Library in {DSPSDK}/lib/aivoice/prebuilts.
Import
{DSPSDK}/example/aivoice/full_flow_offline
source in Xtensa Xplorer.Set software configurations and modify libraries such as AFE resource, KWS resource if needed.
Add include path (-I)
${workspace_loc}/../lib/aivoice/include
Add library search path (-L)
${workspace_loc}/../lib/aivoice/prebuilts/$(TARGET_CONFIG) ${workspace_loc}/../lib/xa_nnlib/v1.8.1/bin/$(TARGET_CONFIG)/Release ${workspace_loc}/../lib/lib_hifi5/project/hifi5_library/bin/$(TARGET_CONFIG)/Release ${workspace_loc}/../lib/tflite_micro/project/bin/$(TARGET_CONFIG)/Release
Add libraries (-l)
-laivoice -lafe_kernel -lafe_res_2mic50mm -lkernel -lvad -lkws -lasr -lfst -lcJSON -ltomlc99 -ltflite_micro -lxa_nnlib -lhifi5_dsp
Build image, refer to the steps in DSP Build.
Build TFLM Library for DSP, refer to Build TFLM.
Or use the prebuilt TFLM Library in {DSPSDK}/lib/aivoice/prebuilts.
Import
{DSPSDK}/example/aivoice/full_flow_offline
source in Xtensa Xplorer.Set software configurations and modify libraries such as AFE resource, KWS resource if needed.
Add include path (-I)
${workspace_loc}/../lib/aivoice/include
Add library search path (-L)
${workspace_loc}/../lib/aivoice/prebuilts/$(TARGET_CONFIG) ${workspace_loc}/../lib/xa_nnlib/v1.8.1/bin/$(TARGET_CONFIG)/Release ${workspace_loc}/../lib/lib_hifi5/project/hifi5_library/bin/$(TARGET_CONFIG)/Release ${workspace_loc}/../lib/tflite_micro/project/bin/$(TARGET_CONFIG)/Release
Add libraries (-l)
-laivoice -lafe_kernel -lafe_res_2mic50mm -lkernel -lvad -lkws -lasr -lfst -lcJSON -ltomlc99 -ltflite_micro -lxa_nnlib -lhifi5_dsp
Build image, refer to the steps in DSP Build.
FreeRTOS
Switch to GCC project directory in SDK
cd {SDK}/amebasmart_gcc_project
Run
menuconfig.py
to enter the configuration interface./menuconfig.py
Navigate through menu path to enable TFLM Library and AIVoice
--------MENUCONFIG FOR General--------- CONFIG TrustZone ---> ... CONFIG APPLICATION ---> GUI Config ---> ... AI Config ---> [*] Enable TFLITE MICRO [*] Enable AIVoice
Select AFE Resource according to hardware, default is
afe_res_2mic50mm
AI Config ---> [*] Enable TFLITE MICRO [*] Enable AIVoice Select AFE Resource ( ) afe_res_1mic ( ) afe_res_2mic30mm (X) afe_res_2mic50mm ( ) afe_res_2mic70mm
Select KWS Resource, default is fixed keyword
xiao-qiang-xiao-qiang
ni-hao-xiao-qiang
AI Config ---> [*] Enable TFLITE MICRO [*] Enable AIVoice Select AFE Resource Select KWS Resource (X) kws_res_xqxq ( ) kws_res_custom
Build image
./build.py -a full_flow_offline
Linux
(Optional) Modify yocto recipe
{LINUXSDK}/yocto/meta-realtek/meta-sdk/recipes-rtk/aivoice/rtk-aivoice-algo.bb
to change library such as AFE resource, KWS resource if needed.Compile the aivoice algo image using bitbake:
bitbake rtk-aivoice-algo
Expected Result
Download image to EVB, after running, the logs should display the algorithm results as follows:
[AIVOICE] set multi kws mode
---------------------SPEECH COMMANDS---------------------
Command ID1, 打开空调
Command ID2, 关闭空调
Command ID3, 制冷模式
Command ID4, 制热模式
Command ID5, 加热模式
Command ID6, 送风模式
Command ID7, 除湿模式
Command ID8, 调到十六度
Command ID9, 调到十七度
Command ID10, 调到十八度
Command ID11, 调到十九度
Command ID12, 调到二十度
Command ID13, 调到二十一度
Command ID14, 调到二十二度
Command ID15, 调到二十三度
Command ID16, 调到二十四度
Command ID17, 调到二十五度
Command ID18, 调到二十六度
Command ID19, 调到二十七度
Command ID20, 调到二十八度
Command ID21, 调到二十九度
Command ID22, 调到三十度
Command ID23, 开高一度
Command ID24, 开低一度
Command ID25, 高速风
Command ID26, 中速风
Command ID27, 低速风
Command ID28, 增大风速
Command ID29, 减小风速
Command ID30, 自动风
Command ID31, 最大风量
Command ID32, 中等风量
Command ID33, 最小风量
Command ID34, 自动风量
Command ID35, 左右摆风
Command ID36, 上下摆风
Command ID37, 播放音乐
Command ID38, 暂停播放
Command ID39, 接听电话
Command ID40, 挂断电话
---------------------------------------------------------
[AIVOICE] rtk_aivoice version: v1.5.0#S0825120#N1ed33d6#A6c25e38
[AIVOICE] rtk_aivoice_model afe version: afe_2mic_asr_v1.3.1_AfePara_2mic50_v2.0_bf_v0.0_20250401
[AIVOICE] rtk_aivoice_model vad version: vad_v7_opt
[AIVOICE] rtk_aivoice_model kws version: kws_xqxq_v4.1_opt
[AIVOICE] rtk_aivoice_model asr version: asr_cn_v8_opt
[AIVOICE] rtk_aivoice_log_format version: v2
[user] afe output 1 channels raw audio, others: {"abnormal_flag":0,"ssl_angle":-10}
[AIVOICE] [KWS] result: {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.7746397852897644}
[user] wakeup. {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.7746397852897644}
[user] voice angle 90.0
[user] vad. status = 1, offset = 385
[user] vad. status = 0, offset = 1865
[AIVOICE] [ASR] result: {"type":0,"commands":[{"rec":"打开空调","id":1}]}
[user] asr. {"type":0,"commands":[{"rec":"打开空调","id":1}]}
[user] voice angle 90.0
[user] vad. status = 1, offset = 525
[AIVOICE] [KWS] result: {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.750707507133484}
[user] wakeup. {"id":2,"keyword":"ni-hao-xiao-qiang","score":0.750707507133484}
[user] voice angle 90.0
[user] vad. status = 1, offset = 445
[user] vad. status = 0, offset = 1765
[AIVOICE] [ASR] result: {"type":0,"commands":[{"rec":"播放音乐","id":37}]}
[user] asr. {"type":0,"commands":[{"rec":"播放音乐","id":37}]}
[user] voice angle 90.0
Glossary
- AEC
Acoustic Echo Cancellation, or echo cancellation, refers to removing the echo signal from the input signal. The echo signal is generated by a sound played through the speaker of the device then captured by the microphone.
- AFE
Audio Front End, refers to a combination of modules for preprocessing raw audio signals. It’s usually performed to improve the quality of speech signal before the voice interaction, including several speech enhancement algorithms.
- AGC
Automatic Gain Control, an algorithm that dynamically controls the gain of a signal and automatically adjust the amplitude to maintain an optimal signal strength.
- ASR
Automatic Speech Recognition, or Speech-to-Text, refers to recognition of spoken language from audio into text. It can be used to build voice-user interface to enable spoken human interaction with AI devices.
- BF
BeamForming, refers to a spatial filter designed for a microphone array to enhance the signal from a specific direction and attenuate signals from other directions.
- KWS
Keyword Spotting, or wakeup word detection, refers to identifying specific keywords from audio. It is usually the first step in a voice interaction system. The device will enter the state of waiting voice commands after detecting the keyword.
- NN
Neural Network, is a machine learning model used for various task in artificial intelligence. Neural networks rely on training data to learn and improve their accuracy.
- NS
Noise Suppression, or noise reduction, refers to suppressing ambient noises in the signal to enhance the speech signal, especially stationary noises.
- RES
Residual Echo Suppression, refers to suppressing the remained echo signal after AEC processing. It is a postfilter for AEC.
- SSL
Sound Source Localization, or direction of arrival (DOA), refers to estimating the spatial location of a sound source using a microphone array.
- TTS
Text-To-Speech, or speech synthesis, is a technology that converts text into spoken audio. It can be used in any speech-enabled application that requires converting text to speech imitating human voice.
- VAD
Voice Activity Detection, or speech activity detection, is a binary classifier to detect the presence or absence of human speech. It is widely used in speech enhancement, ASR system etc, and can also be used to deactivate some processes during non-speech section of an audio session, saving on computation or bandwidth.