General-purpose embedding is highly desirable for few-shot even zero-shot learning in many application scenarios, including audio tasks. In order to understand representations better, we conducted a thorough error analysis and visualization of HEAR 2021 submission results. Inspired by the analysis, this work experiments with different front-end audio preprocessing methods, including Constant-Q Transform (CQT) and Short-time Fourier transform (STFT), and proposes a Batch Embedding Covariance Regularization (BECR) term to uncover a more holistic simulation of the frequency information received by the human auditory system. We tested the models on the suite of HEAR 2021 tasks, which encompass a broad category of tasks. Preliminary results show (1) the proposed BECR can incur a more dispersed embedding on the test set, (2) BECR improves the PaSST model without extra computation complexity, and (3) STFT preprocessing outperforms CQT in all tasks we tested. Github:https://github.com/ankitshah009/general_audio_embedding_hear_2021
翻译:通用嵌入在许多应用场景(包括音频任务)中对于少样本甚至零样本学习具有高度吸引力。为了更好地理解表示,我们对HEAR 2021提交结果进行了彻底的错误分析和可视化。受此分析的启发,本研究实验了不同的前端音频预处理方法,包括常数Q变换(CQT)和短时傅里叶变换(STFT),并提出了一种批量嵌入协方差正则化(BECR)项,以更全面地模拟人类听觉系统接收的频率信息。我们在HEAR 2021任务集上测试了模型,该任务集涵盖了大量任务类型。初步结果显示:(1)所提出的BECR能在测试集上产生更分散的嵌入;(2)BECR在不增加额外计算复杂度的情况下改进了PaSST模型;(3)在我们测试的所有任务中,STFT预处理优于CQT。GitHub:https://github.com/ankitshah009/general_audio_embedding_hear_2021