Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be optimal. Hence, it is likely that relevant keys or values are being reduced, while less important ones are still preserved. Based on this key insight, we propose a novel deformable audio Transformer for audio recognition, named DATAR, where a deformable attention equipping with a pyramid transformer backbone is constructed and learnable. Such an architecture has been proven effective in prediction tasks,~\textit{e.g.}, event classification. Moreover, we identify that the deformable attention map computation may over-simplify the input feature, which can be further enhanced. Hence, we introduce a learnable input adaptor to alleviate this issue, and DATAR achieves state-of-the-art performance.
翻译:Transformer已在多种任务上取得了显著成果。然而,自注意力计算中的二次复杂度限制了其应用,特别是在低资源场景以及移动或边缘设备中。现有工作提出利用人工设计的注意力模式来降低计算复杂度。然而,这种人工设计的模式与数据无关,可能并非最优。因此,相关键值很可能被削减,而重要性较低的键值却得以保留。基于这一关键洞察,我们提出了一种用于音频识别的新型可变形音频Transformer,命名为DATAR,其中构建了配备金字塔Transformer骨干网络的可变形注意力机制,且该机制是可学习的。这种架构在预测任务(如事件分类)中已被证明有效。此外,我们发现可变形注意力图计算可能过度简化输入特征,这一问题有待进一步改进。为此,我们引入了可学习输入适配器以缓解此问题,最终DATAR实现了最先进的性能。