Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters. Code is available at https://github.com/Qualcomm-AI-research/bcresnet.
翻译:关键词唤醒在智能设备的设备唤醒和用户交互中扮演关键角色,是重要的研究领域。然而,在手机等资源受限设备上高效运行并最小化误差具有挑战性。我们提出一种广播残差学习方法,以较小的模型规模和计算负荷实现高精度。该方法将大部分残差函数配置为1D时间卷积,同时通过广播残差连接将时间输出扩展至频率-时间维度,允许联合使用2D卷积。这种残差映射使网络能够以远低于传统卷积神经网络的计算量有效表示有用的音频特征。我们还基于广播残差学习提出了一种新型网络架构——广播残差网络(BC-ResNet),并描述了如何根据目标设备资源扩展模型。在Google语音命令数据集v1和v2上,BC-ResNet分别实现了98.0%和98.7%的顶尖Top-1准确率,且始终以更少的计算量和参数超越先前方法。代码已在https://github.com/Qualcomm-AI-research/bcresnet开源。