Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and the Information Bottleneck principle in this letter, we rethink a universal plug-and-play strategy and propose a Refining Underlying Information framework called RUI to rise to the challenges both in theory and practice. Specifically, we first transform the objective of speech enhancement into an incremental convergence problem of mutual information between comprehensive speech characteristics and individual speech characteristics, e.g., spectral and acoustic characteristics. By doing so, compared with the existing direct-fitting solutions, the underlying information stems from the conditional entropy of acoustic characteristic given spectral characteristics. Therefore, we design a dual-path multiple refinement iterator based on the chain rule of entropy to refine this underlying information for further approximating target speech. Experimental results on DNS-Challenge dataset show that our solution consistently improves 0.3+ PESQ score over baselines, with only additional 1.18 M parameters. The source code is available at https://github.com/caoruitju/RUI_SE.
翻译:监督式语音增强近年来从神经网络的最新进展中获益显著,尤其是其能够非线性拟合目标语音的多样化表征(如波形或频谱)。然而,这些直接拟合方案在听觉评估中仍面临语音退化及残留噪声的挑战。本文通过将语音增强与信息瓶颈原理相结合,重新思考了一种通用的即插即用策略,并提出了名为RUI的潜在信息精炼框架,以在理论与实践层面应对上述挑战。具体而言,我们首先将语音增强的目标转化为综合语音特征与个体语音特征(例如频谱特征与声学特征)之间互信息的渐近收敛问题。通过这一转化,与现有直接拟合方案相比,潜在信息源自声学特征在给定频谱特征条件下的条件熵。因此,我们基于熵的链式法则设计了一种双路径多级精炼迭代器,通过精炼该潜在信息以进一步逼近目标语音。在DNS-Challenge数据集上的实验结果表明,我们的方案在仅增加1.18M参数的情况下,相较于基线方法持续提升了0.3以上的PESQ评分。源代码已公开于https://github.com/caoruitju/RUI_SE。