We explore on various attention methods on frequency and channel dimensions for sound event detection (SED) in order to enhance performance with minimal increase in computational cost while leveraging domain knowledge to address the frequency dimension of audio data. We have introduced frequency dynamic convolution (FDY conv) in a previous work to release the translational equivariance issue associated with 2D convolution on the frequency dimension of 2D audio data. Although this approach demonstrated state-of-the-art SED performance, it resulted in a model with 150% more trainable parameters. To achieve comparable SED performance with computationally efficient methods for practicality, we explore on lighter alternative attention methods. In addition, we focus on attention methods applied to frequency and channel dimensions. Joint application Squeeze-and-excitation (SE) module and time-frame frequency-wise SE (tfwSE) to apply attention on both frequency and channel dimensions shows comparable performance to SED model with FDY conv with only 2.7% more trainable parameters compared to the baseline model. In addition, we performed class-wise comparison of various attention methods to further discuss various attention methods' characteristics.
翻译:我们探索了面向声音事件检测(SED)任务中频率维度和通道维度的多种注意力方法,旨在利用领域知识处理音频数据的频率维度,以最小的计算成本提升模型性能。此前我们引入了频率动态卷积(FDY conv)来解决二维音频数据在频率维度上的平移等变性限制。虽然该方法展现了最先进的SED性能,但导致模型可训练参数增加了150%。为实现计算高效且具有可比性SED性能的实用化方法,我们研究了更轻量的替代注意力机制。重点关注应用于频率与通道维度的注意力方法,联合使用压缩-激励(SE)模块与时频帧级频率压缩-激励(tfwSE)对频率和通道维度施加注意力,在仅比基线模型多2.7%可训练参数的情况下,达到了与基于FDY conv的SED模型相当的性能。此外,我们通过不同注意力方法的类别级对比分析,进一步探讨了各类注意力机制的差异化特征。