Hallucination remains a fundamental challenge for Multimodal Large Language Models (MLLMs). While Direct Preference Optimization (DPO) is a key alignment framework, existing approaches often rely heavily on costly external evaluators for scoring or rewriting, incurring off-policy learnability gaps and discretization loss. Due to the lack of access to internal states, such feedback overlooks the fine-grained conflicts between different modalities that lead to hallucinations during generation. To address this issue, we propose IRIS (Implicit Reward-Guided Internal Sifting), which leverages continuous implicit rewards in the native log-probability space to preserve full information density and capture internal modal competition. This on-policy paradigm eliminates learnability gaps by utilizing self-generated preference pairs. By sifting these pairs based on multimodal implicit rewards, IRIS ensures that optimization is driven by signals that directly resolve modal conflicts. Extensive experiments demonstrate that IRIS achieves highly competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring any external feedback during preference alignment. These results confirm that IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations.
翻译:幻觉问题仍然是多模态大语言模型面临的核心挑战。尽管直接偏好优化是关键的对齐框架,现有方法通常严重依赖昂贵的外部评估器进行评分或改写,导致离策略可学习性差距和离散化损失。由于无法获取模型内部状态,此类反馈忽略了不同模态间导致生成过程中产生幻觉的细粒度冲突。为解决这一问题,我们提出IRIS(基于隐式奖励引导的内部筛选机制),该方法利用原生对数概率空间中的连续隐式奖励来保持完整信息密度并捕捉内部模态竞争。这种同策略范式通过使用自生成的偏好对消除了可学习性差距。通过基于多模态隐式奖励对这些偏好对进行筛选,IRIS确保优化过程由直接解决模态冲突的信号驱动。大量实验表明,IRIS仅使用5.7k样本就在关键幻觉基准测试中取得了极具竞争力的性能,且在偏好对齐过程中无需任何外部反馈。这些结果证实IRIS为缓解多模态大语言模型幻觉问题提供了一种高效且原理清晰的范式。