The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with pre-recorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.
翻译:对于音频分类模型而言,泛化至各类录音设备的能力是关键的性能指标。不同型号麦克风的频率响应特性差异会导致数字化音频信号出现分布偏移。若训练阶段未考虑这类域偏移,模型在应用于未知设备录制的信号时,其性能可能严重下降。尤其当训练数据仅包含少数麦克风录制的音频信号时,模型将难以泛化至未知设备。为应对该问题,我们将训练集中的音频信号与预录的设备脉冲响应(DIRs)进行卷积处理,以人为增加录音设备的多样性。我们系统研究了DIR增强方法在基于卷积神经网络(CNN)与音频频谱图变换器(Audio Spectrogram Transformer)的声学场景分类任务中的效果。实验结果表明:单独使用DIR增强可获得与当前最优方法Freq-MixStyle相当的性能。但进一步研究发现,DIR增强与Freq-MixStyle具有互补性,二者结合可在未知设备录制信号上实现新的最优性能。