Frequency dynamic convolution (FDY conv) has been a milestone in the sound event detection (SED) field, but it involves a substantial increase in model size due to multiple basis kernels. In this work, we propose partial frequency dynamic convolution (PFD conv), which concatenates static conventional 2D convolution branch output and dynamic FDY conv branch output in order to minimize model size increase while maintaining the performance. Additionally, we propose multi-dilated frequency dynamic convolution (MDFD conv), which integrates multiple dilated frequency dynamic convolution (DFD conv) branches with different dilation size sets and a static branch within a single convolution module, achieving a 3.2% improvement in polyphonic sound detection score (PSDS) over FDY conv. Proposed methods with extensive ablation studies further enhance understanding and usability of FDY conv variants.
翻译:频率动态卷积(FDY conv)已成为声音事件检测(SED)领域的一个重要里程碑,但其因使用多个基础核而显著增加了模型规模。在本研究中,我们提出了部分频率动态卷积(PFD conv),该方法将静态的传统二维卷积分支输出与动态的FDY卷积分支输出进行拼接,以在保持性能的同时最小化模型规模的增加。此外,我们提出了多扩张频率动态卷积(MDFD conv),该模块在单个卷积模块内整合了多个具有不同扩张尺寸集的扩张频率动态卷积(DFD conv)分支以及一个静态分支,从而在复音声音检测分数(PSDS)上相比FDY conv实现了3.2%的提升。所提出的方法结合广泛的消融研究,进一步加深了对FDY conv变体的理解并提升了其可用性。