Despite substantial efforts toward improving the moral alignment of Vision-Language Models (VLMs), it remains unclear whether their ethical judgments are stable in realistic settings. This work studies moral robustness in VLMs, defined as the ability to preserve moral judgments under textual and visual perturbations that do not alter the underlying moral context. We systematically probe VLMs with a diverse set of model-agnostic multimodal perturbations and find that their moral stances are highly fragile, frequently flipping under simple manipulations. Our analysis reveals systematic vulnerabilities across perturbation types, moral domains, and model scales, including a sycophancy trade-off where stronger instruction-following models are more susceptible to persuasion. We further show that lightweight inference-time interventions can partially restore moral stability. These results demonstrate that moral alignment alone is insufficient and that moral robustness is a necessary criterion for the responsible deployment of VLMs.
翻译:尽管在提升视觉-语言模型(VLMs)的道德对齐方面已付出大量努力,但其在现实场景中的伦理判断是否稳定仍不明确。本研究探讨VLMs的道德鲁棒性,即在不改变内在道德情境的文本与视觉扰动下保持道德判断的能力。我们通过一系列模型无关的多模态扰动对VLMs进行系统性测试,发现其道德立场高度脆弱,常在简单操作下发生反转。分析揭示了跨扰动类型、道德领域与模型规模的系统性漏洞,包括一种谄媚性权衡:指令跟随能力更强的模型更容易被诱导。我们进一步证明,轻量级的推理时干预可部分恢复道德稳定性。这些结果表明,仅实现道德对齐尚不充分,道德鲁棒性才是VLMs负责任部署的必要标准。