Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.
翻译:音频设备中的语音增强(SE)通常由用于语音活动检测(VAD)、信噪比估计或声学场景分类的辅助模块支持,以确保鲁棒的上下文感知行为和流畅的用户体验。与SE类似,这些任务通常也采用深度学习;然而,在设备端部署额外模型在计算上不切实际,而基于云的推理则会引入额外延迟并损害隐私。先前关于SE的研究采用动态通道剪枝(DynCP),通过基于当前输入自适应地禁用特定通道来减少计算量。在本工作中,我们研究是否可以从这些内部剪枝掩码中估计有用的信号属性,从而消除对独立模型的需求。我们证明,简单、可解释的预测器在VAD上达到高达93%的准确率,在噪声分类上达到84%,在F0估计上R2达到0.86。使用二值掩码时,预测简化为加权求和,引入的开销可忽略不计。我们的贡献是双重的:一方面,我们通过下游预测任务的视角审视DynCP模型的涌现行为,以揭示它们正在学习什么;另一方面,我们重新定位并再次提出DynCP,将其作为高效SE及同步估计信号属性的整体解决方案。