KWS (Keyword Spotting)

Supported ICs

Overview

KWS is the module to detect specific wakeup words from audio. It is usually the first step in a voice interaction system. The device will enter the state of waiting voice commands after detecting the keyword.

AIVoice provides two KWS solutions: a fixed keyword solution and a user-defined keyword solution. The former can achieve optimal performance on low-resource devices, while the latter allows flexible customization of keywords.

Solution

Training data

Available keywords

Feature

Fixed keyword

Specific keywords

Keywords same as training data

better performance, smaller model

User-defined keyword

Common data

Flexible keyword of the same language as training data

More flexible

Currently SDK provides a fixed keyword model library and a user-defined model.

Fixed Keyword Model

  • Support Chinese keyword xiao-qiang-xiao-qiang or ni-hao-xiao-qiang.

  • Other keywords or performance optimizations can be provided through customized services.

User-defined Keyword Model

  • Language Support: Chinese only

  • Number of Keyword: Supports up to 5 keywords simultaneously.

  • Word Length: Each keyword must contain 3 to 6 Chinese characters; words outside this range are invalid.

  • Keyword Selection Guidelines

    • Avoid characters with zero initials(e.g., yīn, ).

    • Avoid common daily phrases (e.g., put on clothes, eat breakfast).

    • Ensure high phonetic distinction between adjacent syllables.

KWS Mode

Two KWS modes are provided for different use cases. Single-channel mode processes single-channel audio as input, while Multi-channel mode processes multi-channel as input. Multi-channel mode improves accuracy for KWS and ASR compared to single-channel mode. However, it also increases computational resource consumption and memory usage.

KWS mode

Config

Description

Single-channel mode

mode = KWS_SINGLE_MODE

Less computation resource consumption and less memory usage

Multi-channel mode

mode = KWS_MULTI_MODE

Better KWS and ASR accuracy

Algorithm Flow

  • Single-channel Mode

kws_single_channel
  • Multi-channel Mode

kws_multi_channel

Configurations

KWS configurable parameters

keywords:

Keywords for wake up, and available keywords depend on KWS model. If the KWS model is a fixed keyword solution, keywords can only be chosen from the trained words. For user-defined solution, keywords can be customized with any combinations of same language unit(such as pinyin for Chinese). Example: xiao-qiang-xiao-qiang.

thresholds:

Threshold for wake up, range [0, 1]. The higher, less false alarm, but harder to wake up. Set to 0 to use sensitivity with predefined thresholds.

sensitivity:

Three levels of sensitivity are provided with predefined thresholds. The higher, easier to wake up but also more false alarm. ONLY works when thresholds set to 0.

mode:

KWS mode, single-channel mode or multi-channel mode.

enable_age_gender:

Whether enable output speaker’s age and gender classification when wake up. Not supported in current version.

Refer to ${aivoice_lib_dir}/include/aivoice_kws_config.h for details.

Threshold Adjustment Suggestions

  • As the threshold increases from low to high, the wakeup rate gradually decreases, and false wakeup reduce (i.e., sensitivity shifts from high to low). Users should select an appropriate threshold based on actual needs.

  • For fixed keyword model, three sensitivity levels are provided: High, Medium, and Low, corresponding to ~1 false trigger per 12h, 24h, and 48h, respectively. For finer adjustments, users can configure the thresholds parameter to adapt to their usage scenario, with a step size of 0.02.

  • For user-defined keyword model, the thresholds are typically lower than fixed keyowrd model, with a suggested adjustment step size of 0.005.

kws_roc