Learning meaningful frame-wise features on a partially labeled dataset is crucial to semi-supervised sound event detection. Prior works either maintain consistency on frame-level predictions or seek feature-level similarity among neighboring frames, which cannot exploit the potential of unlabeled data. In this work, we design a Local and Global Consistency (LGC) regularization scheme to enhance the model on both label- and feature-level. The audio CutMix is introduced to change the contextual information of clips. Then, the local consistency is adopted to encourage the model to leverage local features for frame-level predictions, and the global consistency is applied to force features to align with global prototypes through a specially designed contrastive loss. Experiments on the DESED dataset indicate the superiority of LGC, surpassing its respective competitors largely with the same settings as the baseline system. Besides, combining LGC with existing methods can obtain further improvements. The code will be released soon.
翻译:在部分标注数据集上学习有意义的帧级特征对于半监督声音事件检测至关重要。现有方法要么保持帧级预测的一致性,要么寻求相邻帧之间的特征级相似性,这无法充分利用无标签数据的潜力。本文设计了一种局部与全局一致性(LGC)正则化方案,以在标签层面和特征层面同时增强模型性能。我们引入音频CutMix来改变片段的上下文信息,进而采用局部一致性机制促使模型利用局部特征进行帧级预测,并通过专门设计的对比损失函数施加全局一致性约束,迫使特征与全局原型对齐。在DESED数据集上的实验表明,LGC方法具有显著优势,在与基线系统相同的设置下大幅超越同类竞争者。此外,将LGC与现有方法结合可获得进一步性能提升。相关代码即将开源。