时序注意力池化在声音事件检测频率动态卷积中的应用 (Temporal Attention Pooling for Frequency Dynamic Convolution in Sound Event Detection)

Recent advances in deep learning, particularly frequency dynamic convolution (FDY conv), have significantly improved sound event detection (SED) by enabling frequency-adaptive feature extraction. However, FDY conv relies on temporal average pooling, which treats all temporal frames equally, limiting its ability to capture transient sound events such as alarm bells, door knocks, and speech plosives. To address this limitation, we propose temporal attention pooling frequency dynamic convolution (TFD conv) to replace temporal average pooling with temporal attention pooling (TAP). TAP adaptively weights temporal features through three complementary mechanisms: time attention pooling (TA) for emphasizing salient features, velocity attention pooling (VA) for capturing transient changes, and conventional average pooling for robustness to stationary signals. Ablation studies show that TFD conv improves average PSDS1 by 3.02% over FDY conv with only a 14.8% increase in parameter count. Classwise ANOVA and Tukey HSD analysis further demonstrate that TFD conv significantly enhances detection performance for transient-heavy events, outperforming existing FDY conv models. Notably, TFD conv achieves a maximum PSDS1 score of 0.456, surpassing previous state-of-the-art SED systems. We also explore the compatibility of TAP with other FDY conv variants, including dilated FDY conv (DFD conv), partial FDY conv (PFD conv), and multi-dilated FDY conv (MDFD conv). Among these, the integration of TAP with MDFD conv achieves the best result with a PSDS1 score of 0.459, validating the complementary strengths of temporal attention and multi-scale frequency adaptation. These findings establish TFD conv as a powerful and generalizable framework for enhancing both transient sensitivity and overall feature robustness in SED.

翻译：深度学习的最新进展，特别是频率动态卷积（FDY conv），通过实现频率自适应的特征提取，显著提升了声音事件检测（SED）的性能。然而，FDY conv依赖于时序平均池化，该方法对所有时间帧一视同仁，限制了其捕捉瞬态声音事件（如警报声、敲门声和语音爆破音）的能力。为克服这一局限，我们提出了时序注意力池化频率动态卷积（TFD conv），用时序注意力池化（TAP）替代时序平均池化。TAP通过三种互补机制自适应地加权时序特征：用于突出显著特征的时间注意力池化（TA）、用于捕捉瞬态变化的速率注意力池化（VA），以及用于保持对平稳信号鲁棒性的常规平均池化。消融实验表明，与FDY conv相比，TFD conv在参数量仅增加14.8%的情况下，将平均PSDS1提升了3.02%。类间方差分析和Tukey HSD分析进一步证明，TFD conv显著增强了对瞬态丰富事件的检测性能，优于现有的FDY conv模型。值得注意的是，TFD conv取得了0.456的最高PSDS1分数，超越了先前最先进的SED系统。我们还探索了TAP与其他FDY conv变体的兼容性，包括膨胀FDY conv（DFD conv）、部分FDY conv（PFD conv）和多膨胀FDY conv（MDFD conv）。其中，TAP与MDFD conv的集成取得了最佳结果，PSDS1分数达到0.459，验证了时序注意力与多尺度频率自适应之间的互补优势。这些发现确立了TFD conv作为一个强大且可推广的框架，能够有效增强SED中的瞬态敏感性和整体特征鲁棒性。