Low latency models are critical for real-time speech enhancement applications, such as hearing aids and hearables. However, the sub-millisecond latency space for resource-constrained hearables remains underexplored. We demonstrate speech enhancement using a computationally efficient minimum-phase FIR filter, enabling sample-by-sample processing to achieve mean algorithmic latency of 0.32 ms to 1.25 ms. With a single microphone, we observe a mean SI-SDRi of 4.1 dB. The approach shows generalization with a DNSMOS increase of 0.2 on unseen audio recordings. We use a lightweight LSTM-based model of 644k parameters to generate FIR taps. We benchmark that our system can run on low-power DSP with 388 MIPS and mean end-to-end latency of 3.35 ms. We provide a comparison with baseline low-latency spectral masking techniques. We hope this work will enable a better understanding of latency and can be used to improve the comfort and usability of hearables.
翻译:低延迟模型对于实时语音增强应用(如助听器和可穿戴设备)至关重要。然而,针对资源受限可穿戴设备的亚毫秒延迟领域仍未得到充分探索。我们展示了一种使用计算高效的最小相位FIR滤波器进行语音增强的方法,该方法支持逐样本处理,实现了0.32毫秒至1.25毫秒的平均算法延迟。在单麦克风配置下,我们观察到平均SI-SDRi提升为4.1分贝。该方法展现出泛化能力,在未见过的音频录音上DNSMOS指标提升了0.2。我们采用一个包含64.4万参数的轻量级LSTM模型来生成FIR抽头系数。基准测试表明,我们的系统可在低功耗数字信号处理器上运行,仅需388 MIPS且平均端到端延迟为3.35毫秒。我们提供了与基线低延迟谱掩蔽技术的对比分析。本研究有望深化对延迟机制的理解,并可用于提升可穿戴设备的舒适度与实用性。