Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: {https://anonymous.4open.science/r/dvpd_demo-E630}

翻译：扩散模型近期在语音增强领域创下了新的性能基准。然而，现有大多数基于分数的模型仅将语音谱图视为普通二维图像，采用统一处理方法，忽略了音频固有的结构稀疏性，导致频谱表示效率低下且计算复杂度极高。为弥补这一差距，我们提出DVPD——一种极其轻量的双视角预测扩散模型，该模型在训练和推理阶段均独特地利用了谱图兼具视觉纹理与物理频域表示的双重特性。具体而言，在训练阶段，我们通过频率自适应非均匀压缩编码器优化频谱利用率，该编码器在保留关键低频谐波的同时，修剪高频冗余信息。同时，我们引入轻量级图像化谱感知模块，以最小开销从视觉视角捕获特征。在推理阶段，我们提出无需训练的无损增强策略，该策略利用相同的双视角先验知识提升生成质量，且无需任何额外微调。在多个基准测试上的广泛实验表明，DVPD在仅需当前最先进轻量级模型PGUSE 35%的参数和40%推理乘加运算量的情况下，实现了最先进的性能。这些结果凸显了DVPD在平衡高保真语音质量与极致架构效率方面的卓越能力。代码和音频样本发布于匿名网站：{https://anonymous.4open.science/r/dvpd_demo-E630}