In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.
翻译:本研究提出了一种用于语音深度伪造检测的多任务Transformer模型,该模型能够预测随时间变化的共振峰轨迹和浊音模式,最终将语音分类为真实或伪造,并突出显示其决策是否更多地依赖于浊音或清音区域。基于先前的说话人共振峰Transformer架构,我们通过改进的输入分段策略简化了模型结构,重新设计了解码过程,并集成了内置可解释性机制。与基线模型相比,我们的模型在保持预测性能的同时,所需参数更少、训练速度更快,且具有更好的可解释性。