The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.
翻译:音频频谱图是一种广泛用于音频分类的时频表示。其关键属性之一为时间分辨率,该分辨率取决于短时傅里叶变换(STFT)中使用的帧移大小。以往研究通常假定帧移应为恒定值(例如10毫秒)。然而,固定时间分辨率并非对所有声音类型均最优。时间分辨率不仅影响分类精度,还会影响计算成本。本文提出一种名为DiffRes的新方法,可实现音频分类的可微分时间分辨率建模。给定一个以固定帧移计算得到的频谱图,DiffRes在保留关键帧的同时合并非必要时间帧。DiffRes可作为"即插即用"模块置于音频频谱图与分类器之间,并与分类任务联合优化。我们使用梅尔频谱图作为声学特征,结合现成分类器骨干网络,在五项音频分类任务上评估DiffRes。与使用固定时间分辨率的传统方法相比,基于DiffRes的方法在降低至少25%计算成本的同时,可实现相同或更优的分类精度。我们进一步证明,DiffRes可通过提高输入声学特征的时间分辨率来提升分类精度,且不增加计算成本。