Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be optimal. Hence, it is likely that relevant keys or values are being reduced, while less important ones are still preserved. Based on this key insight, we propose a novel deformable audio Transformer for audio recognition, named DATAR, where a deformable attention equipping with a pyramid transformer backbone is constructed and learnable. Such an architecture has been proven effective in prediction tasks,~\textit{e.g.}, event classification. Moreover, we identify that the deformable attention map computation may over-simplify the input feature, which can be further enhanced. Hence, we introduce a learnable input adaptor to alleviate this issue, and DATAR achieves state-of-the-art performance.
翻译:Transformer已在多种任务上取得了显著成果。然而,自注意力计算中的二次复杂度限制了其应用,尤其在低资源场景及移动或边缘设备中。现有研究提出利用人工设计的注意力模式来降低计算复杂度,但此类模式与数据无关,未必是最优的。因此,可能相关的键或值被缩减,而重要性较低的却仍被保留。基于这一关键见解,我们提出了一种用于音频识别的新型可变形音频Transformer——DATAR,其中构建了配备金字塔Transformer主干的可变形注意力机制,且该机制可学习。此类架构已被证明在预测任务(如事件分类)中有效。此外,我们发现可变形注意力图的计算可能过度简化输入特征,这一问题可进一步优化。因此,我们引入可学习输入适配器以缓解此问题,最终DATAR实现了最先进的性能。