Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.
翻译:声学事件检测(SED)在音频理解中扮演着重要角色,广泛应用于监控、智慧城市、医疗保健及多媒体索引等领域。然而,传统SED系统基于封闭世界假设运作,限制了其在新型声学事件频繁出现的真实环境中的有效性。受计算机视觉中开放世界学习成功的启发,我们提出了开放世界声学事件检测(OW-SED)范式——模型需同时检测已知事件、识别未知事件并增量学习新事件。针对OW-SED中重叠事件与模糊事件等独特挑战,我们提出一维可变形架构,利用可变形注意力机制自适应聚焦显著时域区域。此外,我们设计了一种新颖的开放世界可变形声学事件检测Transformer(WOOT)框架,通过特征解耦分离类别特定与类别无关表征,并引入一对多匹配策略与多样性损失以增强表征多样性。实验结果表明,该方法在封闭世界场景下性能略优于现有领先技术,在开放世界场景中则显著超越现有基线方法。