Gradient-based attribution methods can highlight input regions important for neural network predictions, but their effectiveness for temporal sound event detection in audio classification has not been systematically evaluated. This paper assesses whether integrated gradients (IG) can temporally detect sound events when applied to a classifier trained without temporal supervision. We use synthetic polyphonic audio with ground truth timestamps to measure alignment between IG attributions and event boundaries. On a 10-class domestic sound dataset, IG achieves mean Intersection over Union (IoU) of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6\%. For comparison, a framewise CNN trained with weak supervision (FW-WS, clip-level training labels) achieves 0.42 IoU, 0.55 F1, and 97.3\% PG, while a strongly supervised variant (FW-SS, frame-level training labels) reaches 0.45 IoU, 0.58 F1, and 97.9\% PG. Overall, these results suggest that post-hoc IG captures meaningful temporal activity patterns of sound events, with localization performance approaching models that explicitly produce frame-level predictions. All methods substantially outperform random and energy-based baselines.
翻译:基于梯度的归因方法能够突出对神经网络预测重要的输入区域,但其在音频分类中用于时间声音事件检测的有效性尚未得到系统评估。本文评估了在无时间监督训练的分类器上应用集成梯度(IG)是否能够时间性地检测声音事件。我们使用具有真实时间戳的合成多声道音频,测量IG归因与事件边界之间的对齐程度。在一个包含10类家庭声音的数据集上,IG实现了平均交并比(IoU)为0.39,帧级F1得分为0.52,点游戏准确率为82.6%。作为对比,使用弱监督(FW-WS,片段级训练标签)训练的帧级CNN达到了0.42的IoU、0.55的F1得分和97.3%的PG;而强监督变体(FW-SS,帧级训练标签)则达到了0.45的IoU、0.58的F1得分和97.9%的PG。总体而言,这些结果表明,事后IG捕捉到了声音事件中有意义的时间活动模式,其定位性能接近显式产生帧级预测的模型。所有方法均显著优于随机和基于能量的基线方法。