Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments across diverse benchmarks demonstrate that REAL significantly enhances discrimination accuracy and achieves state-of-the-art performance, validating the effectiveness of our pivot-driven resolution paradigm.
翻译:知识密集型视觉问答(KI-VQA)常因开放域检索的固有局限而遭受严重的知识冲突。然而,现有范式由于缺乏可泛化的冲突检测机制以及处理冲突证据的模型内约束机制,面临关键局限。为应对这些挑战,我们提出了以新颖的“推理枢纽”概念为核心的REAL(推理枢纽对齐)框架。与优先内部自推导的推理步骤不同,推理枢纽是推理链中强调知识关联的原子单元(节点或边),其通常依赖外部证据来完成推理。基于我们构建的REAL-VQA数据集,我们的方法整合了推理枢纽感知监督微调(RPA-SFT),通过将冲突与枢纽提取对齐来训练一个可泛化的判别器,并采用了推理枢纽引导解码(RPGD)——一种利用这些枢纽进行针对性冲突缓解的模型内解码策略。在多种基准测试上的大量实验表明,REAL显著提升了判别准确性并实现了最先进的性能,验证了我们枢纽驱动解决范式的有效性。