VAD (Voice Activity Detection)

Supported ICs[ RTL8721Dx ][ RTL8726E ][ RTL8713E ][ RTL8730E ]

Overview

VAD is the module to detect the presence of human speech in audio.

In AIVoice, a neural network based VAD is provided and can be used in speech enhancement, ASR system etc.

Refer to Event and Callback Message to see VAD’s output.

Configurations

VAD configurable parameters

sensitivity:: Three levels of sensitivity are provided with predefined thresholds. The higher, easier to detect speech but also more false alarm.
left_margin:: Time margin added to the start of speech segment, which makes the start offset earlier than raw prediction. Only affects offset_ms of VAD output, it won’t affect the event trigger time of status 1.
right_margin:: Time margin added to the end of speech segment, which makes the end offset later than raw prediction. Affects both offset_ms of VAD output and event time of status 0.
min_speech_duration:: The minimum duration of a speech segment. If the segment does not meet this value before adding left_margin and right_margin, it will be discarded to reduce VAD false alarms.

Refer to ${aivoice_lib_dir}/include/aivoice_vad_config.h for details.

Note

left_margin only affects offset_ms returned by VAD, it won’t affect the VAD event trigger time. If you need get the audio during left_margin, please implement a buffer to keep audio.

Suggestions for adjusting parameters

Suggestion for adjusting left_margin

The larger the left_margin is, the more the vad segment expands to the left, and the richer the information near the starting point of the speech is contained, which can reduce the situation where the speech is incompletely segmented at the starting point. However, a large left_margin setting is also prone to introducing noise (including background noise or irrelevant speech), and a larger cache space needs to be reserved.

Case 1: Properly increase left_margin to reduce the clipping of the front part of the speech

Case 2: Excessive increase in left_margin may introduce irrelevant speech

Suggestion for adjusting right_margin

The larger the right_margin is, the more the vad segment expands to the right, and the more information near the end of the speech is included, which can reduce the situation where the speech is incompletely segmented at the ending point . However, too large a right_margin setting can easily introduce noise (including background noise or irrelevant speech) and increase latency.

Case 1: Properly increase right_margin to reduce the clipping of the tail speech

Case 2: Excessive increase in right_margin may introduce irrelevant noise

Case 3: Long sentence scenario, increasing right_margin can reduce the situation where long sentences are cut apart due to pauses

In general, left_margin and right_margin should not be too large, and can be adjusted to cover most of the speech segments. For long-sentence dialogue scenarios, right_margin should be increased to prevent the algorithm from prematurely ending the segment capture when the user pauses in the middle of speaking. However, increasing right_margin will also increase latency, so it is necessary to make reasonable adjustments based on actual conditions.

Suggestion for adjusting min_speech_duration

The larger the min_speech_duration, the higher the VAD’s requirement for the duration of the speech segment, and correspondingly, the false alarms in noisy environments will also be lower. Generally speaking, it is recommended to set the min_speech_duration between 0 and 300ms. Experiments have shown that setting it to 200ms has almost no effect on the recognition performance, and the false alarms are significantly reduced. When set to 300ms, the false alarms are greatly reduced, but the overall recognition performance may also be affected to some extent. In practical use, it is necessary to set the min_speech_duration according to the usage requirements to balance the recognition performance and false alarms.