Overlapping sound events are ubiquitous in real-world environments, but existing end-to-end sound event detection (SED) methods still struggle to detect them effectively. A critical reason is that these methods represent overlapping events using shared and entangled frame-wise features, which degrades the feature discrimination. To solve the problem, we propose a disentangled feature learning framework to learn a category-specific representation. Specifically, we employ different projectors to learn the frame-wise features for each category. To ensure that these feature does not contain information of other categories, we maximize the common information between frame-wise features within the same category and propose a frame-wise contrastive loss. In addition, considering that the labeled data used by the proposed method is limited, we propose a semi-supervised frame-wise contrastive loss that can leverage large amounts of unlabeled data to achieve feature disentanglement. The experimental results demonstrate the effectiveness of our method.
翻译:重叠声音事件在真实环境中普遍存在,但现有的端到端声音事件检测方法仍难以有效检测此类事件。关键原因在于这些方法使用共享且纠缠的帧级特征表示重叠事件,导致特征区分度下降。为解决该问题,我们提出一种解缠特征学习框架,用于学习类别特定的表示。具体而言,我们采用不同的投影器为每个类别学习帧级特征。为确保这些特征不包含其他类别信息,我们在同一类别的帧级特征间最大化共有信息,并提出一种帧级对比损失。此外,考虑到所提方法使用的标注数据有限,我们提出一种半监督帧级对比损失,可利用大量无标注数据实现特征解缠。实验结果表明了该方法的有效性。