Multimodal large language models (MLLMs) enable interaction over both text and images, but their safety behavior can be driven by unimodal shortcuts instead of true joint intent understanding. We introduce CSR-Bench, a benchmark for evaluating cross-modal reliability through four stress-testing interaction patterns spanning Safety, Over-rejection, Bias, and Hallucination, covering 61 fine-grained types. Each instance is constructed to require integrated image-text interpretation, and we additionally provide paired text-only controls to diagnose modality-induced behavior shifts. We evaluate 16 state-of-the-art MLLMs and observe systematic cross-modal alignment gaps. Models show weak safety awareness, strong language dominance under interference, and consistent performance degradation from text-only controls to multimodal inputs. We also observe a clear trade-off between reducing over-rejection and maintaining safe, non-discriminatory behavior, suggesting that some apparent safety gains may come from refusal-oriented heuristics rather than robust intent understanding. WARNING: This paper contains unsafe contents.
翻译:多模态大语言模型(MLLMs)支持文本与图像的交互,但其安全行为可能由单模态捷径驱动,而非真正的联合意图理解。我们提出了CSR-Bench,这是一个通过四种压力测试交互模式(涵盖安全性、过度拒绝、偏见与幻觉,共包含61个细粒度类型)来评估跨模态可靠性的基准。每个实例均被构建为需要整合图文解释,我们还额外提供了配对的纯文本对照样本,以诊断模态引发的行为偏移。我们评估了16个前沿的MLLM,并观察到系统性的跨模态对齐差距。模型表现出薄弱的安全意识、干扰下强烈的语言主导性,以及从纯文本对照到多模态输入时一致性的性能下降。我们还观察到在减少过度拒绝与保持安全、非歧视行为之间存在明显的权衡,这表明某些表面的安全性提升可能源于面向拒绝的启发式策略,而非稳健的意图理解。警告:本文包含不安全内容。