In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3x energy consumption reduction.
翻译:在关键字识别(KWS)的背景下,将手工设计的语音特征替换为可学习特征并未带来KWS性能的提升。本研究表明,当滤波器组通道数大幅减少时,滤波器组学习在KWS中优于手工设计的语音特征。减少通道数可能导致一定程度的KWS性能下降,但也会显著降低能耗——这对在低资源设备上部署常见的始终开启KWS系统至关重要。在含噪版Google语音指令数据集上的实验结果显示,滤波器组学习能适应噪声特性,从而提供更强的抗噪鲁棒性,尤其当集成dropout时效果更为显著。因此,从常用的40通道log-Mel特征转换为8通道学习特征,仅导致3.5%的相对KWS准确率损失,同时实现6.3倍的能耗降低。