Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existing methods either rely on large-scale annotated data for fine-tuning, which incurs massive computational overhead, or employ static post-hoc strategies that overlook the dynamic nature of hallucination emergence. To address these, we introduce a new self-rewarding framework, enabling dynamic hallucination mitigation at inference time without external supervision. On the empirical side, we reveal that visual hallucination exhibits phase-wise dynamic patterns, peaking at the onset of each semantic phase. Drawing on these insights, we propose \textbf{PSRD} (\textbf{Phase-wise \textbf{S}elf-\textbf{R}eward \textbf{D}ecoding) for online hallucination correction guided by phase-wise self-reward signals. To reduce the cost of repeated self-evaluation during decoding, we distill the hallucination guidance signal from LVLMs into a lightweight reward model. The reward model subsequently provides on-the-fly guidance for targeted intervention during the decoding process, enabling precise hallucination suppression. The proposed PSRD significantly reduces the hallucination rate of LLaVA-1.5-7B by 50.0% and consistently outperforms existing post-hoc methods across five hallucination evaluation benchmarks for four LVLMs. Further analysis confirms that PSRD effectively mitigates hallucination propagation and achieves a highly controllable trade-off between strong performance and inference efficiency.
翻译:大型视觉语言模型(LVLMs)仍面临视觉幻觉问题,即生成响应与视觉输入不一致。现有方法要么依赖大规模标注数据进行微调(导致巨大计算开销),要么采用静态事后策略,忽视幻觉产生的动态演化特性。针对此问题,我们提出新型自奖励框架,可在推理阶段无需外部监督实现动态幻觉缓解。实验发现,视觉幻觉呈现相位动态模式,在每个语义相位初始阶段达到峰值。基于此洞见,我们提出\textbf{PSRD}(\textbf{相位自奖励解码}),通过相位自奖励信号实现在线幻觉纠正。为降低解码过程中重复自评估的计算成本,我们将LVLMs的幻觉引导信号蒸馏为轻量级奖励模型。该奖励模型在解码过程中实时提供定向干预指导,实现精准幻觉抑制。所提PSRD使LLaVA-1.5-7B的幻觉率降低50.0%,并在四项LVLMs的五项幻觉评估基准上持续优于现有事后方法。进一步分析证实,PSRD可有效缓解幻觉传播,并在强性能与推理效率间实现高度可控的平衡。