Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We further perform a layer-wise analysis of forensic signal perception in MLLMs, showing that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across most benchmarks. The code and data are available at https://github.com/KQL11/Deep-VRM.
翻译:多模态大语言模型(MLLMs)因其强大的语义理解能力而被日益广泛地应用于取证领域。随着AI生成图像日趋逼真,仅凭语义层面的不一致性往往不足以实现可靠检测。这引出了一个关键问题:MLLMs能否实现全谱系取证信号的感知,即在不牺牲预训练语义知识的前提下捕获底层生成器伪影?我们进一步对MLLMs中的取证信号感知进行了逐层分析,表明语义信息主要形成于浅层到中间层,而直接针对伪影学习进行微调会破坏这些语义表征。基于这一洞见,我们提出深度视觉残差多模态大语言模型(Deep-VRM),在保留早期语义处理能力的同时,将伪影特定的视觉信号作为残差路径注入中间层,使其与语义令牌表征融合,并通过后续可训练层进行传播。这使得深层网络能够联合建模语义推理与信号级取证线索,令人惊讶的是,模型学会了根据输入自适应地利用不同层级的取证信号,从而实现了鲁棒且可泛化的检测性能。大量实验表明,我们的方法在大多数基准测试中取得了最先进的结果。代码与数据已开源至https://github.com/KQL11/Deep-VRM。