Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme-guided cross-attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior $P(\text{spoofed}\mid X, W)$, conditioned on the acoustic representation $X$ and the phonetic posteriorgram $W$. The resulting factorization can be written as $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$, where $M$ denotes the number of phonetic classes, $P(\text{spoofed} \mid X, Z = z_i)$ is the spoofing probability for the $i$-th phonetic class $z_i$ conditioned on $X$, and each $w_i$ is the prevalence of phonetic class $z_i$ in the utterance. Our transformer-based architecture instantiates this through a cross-attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax-normalized pooling supplying explicit phone-presence weights. Unlike prior approaches that rely heavily on post-hoc explainability methods, our framework offers phonetic-explainability-by-design. We evaluate the framework on an LJSpeech-derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per-phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence-boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per-articulatory category breakdown of the final verdict.
翻译:[译摘] 语音深度伪造检测通常被视作不透明的分类任务,其将所有时间帧进行均等聚合,忽略了不同音素类别所携带的判别性信息存在显著差异。为解决该问题,我们提出了一种音素引导的交叉注意力框架,将检测过程转化为可解释的、基于语音学原理的流程。我们对条件于声学表示X与音素后验图W的伪造后验概率P(伪造|X,W)进行分解,所得分解式可表示为:P(伪造|X,W)=∑_{i=1}^{M} w_i·P(伪造|X,Z=z_i),其中M表示音素类别数,P(伪造|X,Z=z_i)为条件于X的第i个音素类别z_i的伪造概率,w_i为语句中音素类别z_i的出现频次。我们基于Transformer的架构通过交叉注意力模块实现该分解,模块内音素查询向量选择性地从声学键值对中提取信息,并通过softmax归一化池化层提供显式的音素存在权重。不同于先前依赖事后可解释性方法的研究,我们的框架天然具备语音学可解释性设计。我们在LJSpeech衍生语料库、ASVspoof 2019 LA和ASVspoof 5 Track 1数据集上评估了该框架。各音素重要性排序表明,判别性能力集中在生成模型难以忠实复现的发音类别上——塞音、擦音、塞擦音、鼻音及静音边界闭合音具有最高判别性,而周期性元音与半元音则排名较低。除性能优势外,本模型通过结构可解释性提供了逐发音类别的最终判决拆解分析能力。