Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.
翻译:多模态大语言模型在复杂视觉场景中常难以实现忠实推理,因为其中错综复杂的实体与关系需要在每个推理步骤中进行精确的视觉定位。这种推理不忠实性通常表现为幻觉实体、错误定位的关系、跳过的推理步骤以及过度指定的推理过程。现有的基于偏好的方法通常依赖于文本扰动或答案导向的推理链,未能有效应对这一挑战,因为它们允许模型利用语言先验来规避视觉定位。为解决此问题,我们提出了SceneAlign框架,该框架利用场景图作为结构化视觉信息,以执行可控的结构干预。通过识别对推理至关重要的节点,并采用四种模拟典型定位失败的针对性策略对其进行扰动,SceneAlign构建了困难的负样本推理链,这些推理链在语言上看似合理,却基于不准确的视觉事实。这些对比样本对用于直接偏好优化,以引导模型实现细粒度、结构忠实的推理。在七个视觉推理基准测试中,SceneAlign持续提升了答案准确性和推理忠实性,凸显了基于视觉定位的对齐方法对于多模态推理的有效性。