Multi-modal large language models (MLLMs) have achieved remarkable success on complex multi-modal tasks. However, it remains insufficiently explored whether they exhibit $\textbf{modality preference}$, a tendency to favor one modality over another when processing multi-modal contexts. To study this question, we introduce $\textbf{MC\textsuperscript{2}}$ benchmark, which constructs controlled evidence-conflict scenarios to systematically evaluate modality preference in decision-making. Extensive experiments reveal that all 20 tested MLLMs generally demonstrate clear modality preferences, and such preferences can serve as a useful indicator of downstream task performance of MLLMs. Further analysis shows that modality preference can be controlled by instruction guidance and captured within the latent representations of MLLMs. Built on these insights, we propose a probing and steering method based on representation engineering to explicitly control modality preference without requiring additional fine-tuning. This method effectively amplifies modality preference toward a desired direction and demonstrates promising improvements across multiple multi-modal understanding and reasoning tasks.
翻译:多模态大语言模型(MLLMs)在复杂的多模态任务上取得了显著成功。然而,它们是否表现出$\textbf{模态偏好}$——即在处理多模态上下文时倾向于偏爱某一模态而非另一模态——这一问题仍未得到充分探索。为研究此问题,我们引入了$\textbf{MC\textsuperscript{2}}$基准,该基准通过构建受控的证据冲突场景来系统评估决策中的模态偏好。大量实验表明,所有测试的20个MLLM普遍表现出明显的模态偏好,且此类偏好可作为MLLM下游任务性能的有效指标。进一步分析显示,模态偏好可通过指令引导进行调控,并能在MLLM的潜在表征中被捕捉到。基于这些发现,我们提出了一种基于表征工程的探测与调控方法,无需额外微调即可显式控制模态偏好。该方法能有效将模态偏好向期望方向增强,并在多个多模态理解与推理任务中展现出有前景的性能提升。