Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.
翻译:道德推理是构建安全人工智能(AI)的基础,然而随着AI系统从基于文本的助手演变为具身智能体,确保其跨模态的一致性变得至关重要。当前的安全技术在文本语境中已取得成效,但其向视觉输入的泛化能力仍存疑虑。现有的道德评估基准仅依赖纯文本形式,且缺乏对影响道德决策变量的系统性控制。本文表明,在尖端视觉语言模型(VLMs)中,视觉输入会从根本上改变道德决策过程,从而绕过基于文本的安全机制。我们提出了道德困境模拟(MDS)——一个基于道德基础理论(MFT)的多模态基准,通过对视觉与语境变量的正交操控实现机制性分析。评估结果表明,视觉模态会激活类直觉通路,压制在纯文本语境中观察到的更为审慎且安全的推理模式。这些发现揭示了关键脆弱性:经语言调优的安全过滤器无法约束视觉处理过程,从而凸显了多模态安全对齐的迫切需求。