The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with pre-recorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.
翻译:对于音频分类模型而言,能够泛化至多种录音设备是一项关键性能指标。不同型号麦克风的频率响应特性差异会导致数字音频信号出现分布偏移。若训练过程中未考虑此类领域偏移,模型在应用于未见过设备录制的信号时性能将严重下降。尤其当训练数据仅包含少量不同麦克风录制的音频信号时,模型对未知设备的泛化能力将受到显著制约。为解决该问题,我们将训练集中的音频信号与预录制的设备脉冲响应(DIRs)进行卷积运算,以人为增加录音设备的多样性。我们系统研究了DIR增强方法对基于CNN和音频频谱图Transformer的声学场景分类任务的影响。实验结果表明,单独使用DIR增强可获得与当前最先进的Freq-MixStyle方法相媲美的性能。此外,我们还发现DIR增强与Freq-MixStyle具有互补性,两者结合可在训练阶段未见过设备录制的信号上实现新的最优性能水平。