Visualization is of great value in understanding the internal mechanisms of neural networks. Previous work found that LayerCAM is a reliable visualization tool for deep speaker models. In this paper, we use LayerCAM to analyze the widely-adopted data augmentation (DA) approach, to understand how it leads to model robustness. We conduct experiments on the VoxCeleb1 dataset for speaker identification, which shows that both vanilla and activation-based (Act) DA approaches enhance robustness against interference, with Act DA being consistently superior. Visualization with LayerCAM suggests DA helps models learn to delete temporal-frequency (TF) bins that are corrupted by interference. The `learn to delete' behavior explained why DA models are more robust than clean models, and why the Act DA is superior over the vanilla DA when the interference is nontarget speech. However, LayerCAM still cannot clearly explain the superiority of Act DA in other situations, suggesting further research.
翻译:可视化对于理解神经网络内部机制具有重要价值。既往研究发现,LayerCAM是深度说话人模型的可靠可视化工具。本文利用LayerCAM分析广泛采用的数据增广方法,以理解其如何提升模型鲁棒性。我们在VoxCeleb1数据集上进行说话人识别实验,结果表明:基础数据增广与激活驱动数据增广均能增强模型对抗干扰的鲁棒性,且激活驱动方法始终表现更优。LayerCAM可视化揭示,数据增广帮助模型学会删除被干扰污染的时频格点。"学会删除"行为解释了为何增广模型比干净模型更鲁棒,以及为何在非目标语音干扰条件下激活驱动方法优于基础方法。然而,LayerCAM仍无法明确解释激活驱动方法在其他情形的优势,这表明需要进一步研究。