Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we first establish a structural causal model (SCM) to reveal that the context is the main cause of co-occurrence confounders that mislead the model to learn spurious correlations between frames and clip-level labels. Based on the causal analysis, we propose a causal intervention (CI) method for WSSED to remove the negative impact of co-occurrence confounders by iteratively accumulating every possible context of each class and then re-projecting the contexts to the frame-level features for making the event boundary clearer. Experiments show that our method effectively improves the performance on multiple datasets and can generalize to various baseline models.
翻译:现有弱监督声音事件检测(WSSED)研究尚未同时探索两类共现问题,即某些声音事件常同时出现且其发生通常伴随特定背景声音。这些共现现象不可避免导致事件混淆,致使仅依赖片段级监督的模型产生误分类和偏置定位结果。为此,我们首先建立结构因果模型(SCM),揭示上下文是造成共现混淆因子的主要根源,这些混淆因子会误导模型学习帧级特征与片段级标签之间的虚假关联。基于因果分析,我们提出针对WSSED的因果干预(CI)方法,通过迭代累积每类事件所有可能的上下文特征,再将其重新投影至帧级特征以明确事件边界,从而消除共现混淆因子的负面影响。实验表明,该方法能有效提升多个数据集上的检测性能,并可推广至各类基线模型。