The attention-based Transformers have been increasingly applied to audio classification because of their global receptive field and ability to handle long-term dependency. However, the existing frameworks which are mainly extended from the Vision Transformers are not perfectly compatible with audio signals. In this paper, we introduce a Causal Audio Transformer (CAT) consisting of a Multi-Resolution Multi-Feature (MRMF) feature extraction with an acoustic attention block for more optimized audio modeling. In addition, we propose a causal module that alleviates over-fitting, helps with knowledge transfer, and improves interpretability. CAT obtains higher or comparable state-of-the-art classification performance on ESC50, AudioSet and UrbanSound8K datasets, and can be easily generalized to other Transformer-based models.
翻译:基于注意力的Transformer因其全局感受野和处理长期依赖的能力,在音频分类中的应用日益增多。然而,现有主要从视觉Transformer扩展而来的框架与音频信号并非完全兼容。本文提出了一种因果音频Transformer(CAT),其包含多分辨率多特征(MRMF)提取模块及声学注意力模块,以实现更优化的音频建模。此外,我们提出了一种因果模块,该模块可缓解过拟合、促进知识迁移并提升可解释性。CAT在ESC50、AudioSet和UrbanSound8K数据集上取得了与现有最先进方法相当或更优的分类性能,并可轻松泛化至其他基于Transformer的模型。