Although many models exist to detect singing voice deepfakes (SingFake), how these models operate, particularly with instrumental accompaniment, is unclear. We investigate how instrumental music affects SingFake detection from two perspectives. To investigate the behavioral effect, we test different backbones, unpaired instrumental tracks, and frequency subbands. To analyze the representational effect, we probe how fine-tuning alters encoders' speech and music capabilities. Our results show that instrumental accompaniment acts mainly as data augmentation rather than providing intrinsic cues (e.g., rhythm or harmony). Furthermore, fine-tuning increases reliance on shallow speaker features while reducing sensitivity to content, paralinguistic, and semantic information. These insights clarify how models exploit vocal versus instrumental cues and can inform the design of more interpretable and robust SingFake detection systems.
翻译:尽管已有多种模型用于检测歌声深度伪造(SingFake),但这些模型的具体工作机制,尤其是在存在乐器伴奏的情况下,仍不明确。本研究从两个角度探讨乐器音乐对SingFake检测的影响。为探究行为层面的影响,我们测试了不同的主干网络、非配对乐器音轨及频率子带。为分析表征层面的影响,我们探测了微调过程如何改变编码器的语音与音乐处理能力。实验结果表明,乐器伴奏主要起到数据增强的作用,而非提供内在线索(如节奏或和声)。此外,微调会增强模型对浅层说话人特征的依赖,同时降低对内容、副语言及语义信息的敏感度。这些发现阐明了模型如何利用人声与乐器线索,可为设计更具可解释性和鲁棒性的SingFake检测系统提供参考。