Voice-based interfaces rely on a wake-up word mechanism to initiate communication with devices. However, achieving a robust, energy-efficient, and fast detection remains a challenge. This paper addresses these real production needs by enhancing data with temporal alignments and using detection based on two phases with multi-resolution. It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side, which is an ensemble of heterogeneous architectures that refine detection. This scheme allows the optimization of two operating points. To protect privacy, audio features are sent to the cloud instead of raw audio. The study investigated different parametric configurations for feature extraction to select one for on-device detection and another for the verification model. Furthermore, thirteen different audio classifiers were compared in terms of performance and inference time. The proposed ensemble outperforms our stronger classifier in every noise condition.
翻译:基于语音的交互界面依赖唤醒词机制来启动设备通信。然而,实现鲁棒、节能且快速的检测仍具挑战。本文通过增强时序对齐数据并采用基于两阶段多分辨率检测的方法,解决了实际生产中的这些需求。该方法包含两个模型:一个用于实时处理音频流的轻量级端侧模型,以及一个位于服务器端的验证模型(由异构架构集成而成,用于精化检测结果)。这种方案允许对两个运行点进行优化。为保护隐私,系统将音频特征而非原始音频发送至云端。本研究针对特征提取探索了不同参数配置,分别为端侧检测和验证模型选择最优方案。进一步地,研究对比了十三种音频分类器在性能和推理时间上的表现。实验表明,所提出的集成方法在所有噪声条件下均优于最强的单分类器。