Omni-modal Large Language Models (OLLMs) greatly expand LLMs' multimodal capabilities but also introduce cross-modal safety risks. However, a systematic understanding of vulnerabilities in omni-modal interactions remains lacking. To bridge this gap, we establish a modality-semantics decoupling principle and construct the AdvBench-Omni dataset, which reveals a significant vulnerability in OLLMs. Mechanistic analysis uncovers a Mid-layer Dissolution phenomenon driven by refusal vector magnitude shrinkage, alongside the existence of a modal-invariant pure refusal direction. Inspired by these insights, we extract a golden refusal vector using Singular Value Decomposition and propose OmniSteer, which utilizes lightweight adapters to modulate intervention intensity adaptively. Extensive experiments show that our method not only increases the Refusal Success Rate against harmful inputs from 69.9% to 91.2%, but also effectively preserves the general capabilities across all modalities. Our code is available at: https://github.com/zhrli324/omni-safety-research.
翻译:全模态大语言模型(OLLMs)极大地扩展了LLMs的多模态能力,但也引入了跨模态安全风险。然而,目前对全模态交互中漏洞的系统性理解仍然缺乏。为填补这一空白,我们建立了模态-语义解耦原则,并构建了AdvBench-Omni数据集,该数据集揭示了OLLMs中存在的一个显著漏洞。机制分析发现了一个由拒绝向量幅度收缩驱动的中层消解现象,同时发现存在一个模态不变的纯拒绝方向。受这些见解启发,我们利用奇异值分解提取了一个黄金拒绝向量,并提出了OmniSteer方法,该方法利用轻量级适配器自适应地调节干预强度。大量实验表明,我们的方法不仅将针对有害输入的拒绝成功率从69.9%提升至91.2%,而且有效地保留了所有模态下的通用能力。我们的代码可在以下网址获取:https://github.com/zhrli324/omni-safety-research。