We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demonstrate that adding a loss component to mitigate over-attenuation in the enhanced speech, combined with a fine-tuning phase tailored for "always-on" applications, leads to substantial improvements in overall model performance. To compare our proposed architecture with a variety of causal open-source models, we created a new evaluation set comprising long, low-SNR recordings in 12 languages across everyday noise scenarios, better reflecting real-world conditions than commonly used benchmarks. On this evaluation set, DPDFNet delivers superior performance to other causal open-source models, including some that are substantially larger and more computationally demanding. We also propose an holistic metric named PRISM, a composite, scale-normalized aggregate of intrusive and non-intrusive metrics, which demonstrates clear scalability with the number of dual-path blocks. We further demonstrate on-device feasibility by deploying DPDFNet on Ceva-NeuPro-Nano edge NPUs. Results indicate that DPDFNet-4, our second-largest model, achieves real-time performance on NPN32 and runs even faster on NPN64, confirming that state-of-the-art quality can be sustained within strict embedded power and latency constraints.
翻译:本文提出DPDFNet,一种因果性单通道语音增强模型,该模型通过在编码器中引入双路径模块扩展了DeepFilterNet2架构,在保持原有增强框架的同时强化了长时程时序与跨频带建模能力。此外,我们证明在损失函数中增加抑制语音过衰减的分量,并结合针对"常时运行"应用场景的微调阶段,能显著提升模型整体性能。为将所提架构与多种因果性开源模型进行对比,我们构建了包含12种语言、低信噪比长时录音的新评估数据集,其覆盖日常噪声场景,比常用基准测试更能反映真实环境。在该评估集上,DPDFNet的性能优于其他因果性开源模型,包括某些参数量显著更大、计算需求更高的模型。我们还提出名为PRISM的综合性指标,该指标通过对侵入式与非侵入式度量进行尺度归一化复合得到,其数值随双路径模块数量增加呈现明确的可扩展性。通过在Ceva-NeuPro-Nano边缘NPU上部署DPDFNet,我们进一步验证了其端侧部署可行性。结果表明,我们的第二大模型DPDFNet-4在NPN32上达到实时性能,在NPN64上运行速度更快,这证实了在严格的嵌入式功耗与延迟约束下仍可保持前沿的语音增强质量。