Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.
翻译:语音增强模型通常对所有频率采用均匀容量,忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet这一频率自适应架构,将频谱划分为Bark尺度频带,并为每个频带分配基于临界频带密度推导的缩放容量编码器,自动为感知密集的低频段分配更深分支,为高频段分配更浅分支。跨频带注意力模块通过紧凑的频域池化表示,以线性复杂度捕获跨频带的谐波依赖性。基于密集连接的反转残差块和卷积循环网络,BASENet在VoiceBank+DEMAND数据集上仅以0.83M参数和7.3 G-MACs达到3.55 PESQ和STOI~96%,在所有PESQ>3.50的方法中参数数量最少。因果变体(3.44 PESQ)超越了若干非因果基线,证实了其在资源受限设备上实时流式处理的适用性。