Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artifacts of transition regions created when concatenating bona fide and spoofed audio. This focus differs from that of CMs trained on fully spoofed audio, which concentrate on the pattern differences between bona fide and spoofed parts. Our further investigation explains the varying nature of CMs' focus while making correct or incorrect predictions. These insights provide a basis for the design of CM models and the creation of datasets. Moreover, this work lays a foundation of interpretability in the field of partial spoofed audio detection that has not been well explored previously.
翻译:部分篡改句子可极大改变其含义。近期研究表明,在部分伪造音频上训练的反欺骗对策能有效检测此类欺骗。然而,目前对反欺骗对策决策过程的理解仍有限。我们利用Grad-CAM并引入定量分析指标来解读反欺骗对策的决策机制。研究发现,反欺骗对策会优先关注真实音频与伪造音频拼接时产生的过渡区域伪影。这一关注点与在完全伪造音频上训练的反欺骗对策存在差异,后者主要聚焦于真实与伪造片段间的模式差异。我们通过进一步研究阐释了反欺骗对策在做出正确或错误预测时关注点变化的本质。这些发现为反欺骗对策模型的设计与数据集的构建提供了依据。此外,本研究为先前尚未深入探索的部分伪造音频检测领域的可解释性研究奠定了基础。