For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a frame-level punctuation prediction task is added to the semantic VAD, and the artificial endpoint is included in the classification category in addition to the often-used speech presence and absence. To enhance the semantic information of the model, we also incorporate an automatic speech recognition (ASR) related semantic loss. Evaluations on an internal dataset show that the proposed method can reduce the average latency by 53.3% without significant deterioration of character error rate in the back-end ASR compared to the traditional VAD approach.
翻译:在语音交互中,语音活动检测(VAD)常被用作前端模块。然而,传统VAD算法通常需等待连续尾部静音达到预设最大时长才能进行分割,导致较大延迟而影响用户体验。本文提出一种新型语义VAD实现低延迟分割。与现有方法不同,该语义VAD增加了帧级标点预测任务,并在分类类别中除常用语音存在与缺失外纳入人工端点。为增强模型语义信息,我们还融入与自动语音识别(ASR)相关的语义损失函数。在内部数据集上的评估表明,与传统VAD方法相比,所提方法在未显著恶化后端ASR字符错误率的前提下,可将平均延迟降低53.3%。