Recent SELD research has predominantly focused on long-time segment scenarios (typically 5 to 10 seconds, occasionally 2 seconds), improving benchmark performance but lacking the temporal granularity needed for real-world applications. To bridge this gap, this paper investigates SELD with distance estimation (3D SELD) systems under short-time segments, specifically targeting a 1-second window, establishing a new baseline for practical 3D SELD applicability. We further explore the impact of different filter banks -- Bark, Mel, and Gammatone for audio feature extraction, and experimental results demonstrate that the Gammatone filter achieves the highest overall accuracy in this context. Finally, we propose replacing the convolutional modules within the CST-Former, a competitive SELD architecture, with the SCConv module. This adjustment yields measurable F-score gains in short-segment scenarios, underscoring SCConv's potential to improve spatial and channel feature representation. The experimental results highlight our approach as a significant step towards the real-world deployment of 3D SELD systems under low-latency constraints.
翻译:近年来,声学事件定位与检测(SELD)的研究主要集中于长时间片段场景(通常为5至10秒,偶尔为2秒),虽提升了基准性能,但缺乏实际应用所需的时间粒度。为弥补这一差距,本文研究了短时片段(特别针对1秒窗口)下结合距离估计的三维SELD(3D SELD)系统,为实际3D SELD应用建立了新的基准。我们进一步探究了不同滤波器组(Bark、Mel和Gammatone滤波器)在音频特征提取中的影响,实验结果表明在此场景下Gammatone滤波器实现了最高的整体准确率。最后,我们提出在具有竞争力的SELD架构CST-Former中,将其卷积模块替换为SCConv模块。该调整在短片段场景中带来了可量化的F值提升,印证了SCConv在改善空间与通道特征表示方面的潜力。实验结果凸显了我们的方法在低延迟约束下推动3D SELD系统实际部署的重要进展。