Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.
翻译:语音深度伪造检测(SDD)旨在判定给定语音信号的真实性,即其是否为合成生成。现有的基于音频大语言模型(LLM)的方法在内容理解方面表现出色;然而,其预测往往偏向于语义相关的线索,导致在决策过程中忽略了细粒度的声学伪影。因此,尽管存在细微的声学异常,语义自然的伪造语音仍可能绕过检测器;这表明挑战并非源于声学数据的缺失,而是当语义主导的推理占优时,声学数据未能被充分访问。为解决此问题,我们在音频LLM范式下研究SDD,并引入了听觉感知增强的音频大语言模型用于语音深度伪造检测(SDD-APALLM),这是一个声学增强的框架,旨在将细粒度的时频证据显式地暴露为可访问的声学线索。通过结合原始音频与结构化频谱图,所提框架使音频LLM能够在不损害其语义理解的前提下,更有效地捕捉细微的声学不一致性。实验结果表明,该框架在检测准确性和鲁棒性方面取得了持续提升,尤其在语义线索具有误导性的情况下。进一步分析揭示,这些改进源于语义与声学信息的协同利用,而非简单的模态聚合。