Large vision-language models (LVLMs) demonstrate remarkable capabilities in multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. To address these challenges, we propose Dropout Decoding, a novel inference-time approach that quantifies the uncertainty of visual tokens and selectively masks uncertain tokens to improve decoding. Our method measures the uncertainty of each visual token by projecting it onto the text space and decomposing it into aleatoric and epistemic components. Specifically, we focus on epistemic uncertainty, which captures perception-related errors more effectively. Inspired by dropout regularization, we introduce uncertainty-guided token dropout, which applies the dropout principle to input visual tokens instead of model parameters, and during inference rather than training. By aggregating predictions from an ensemble of masked decoding contexts, Dropout Decoding robustly mitigates errors arising from visual token misinterpretations. Evaluations on benchmarks including CHAIR, THRONE, and MMBench demonstrate that Dropout Decoding significantly reduces object hallucinations (OH) and enhances both reliability and quality of LVLM outputs across diverse visual contexts.
翻译:大型视觉-语言模型在多模态任务中展现出卓越能力,但容易误解视觉输入,常导致幻觉输出与不可靠结果。为应对这些挑战,我们提出Dropout解码——一种新颖的推理时方法,通过量化视觉标记的不确定性并选择性掩蔽不确定标记以改进解码过程。该方法通过将每个视觉标记投影至文本空间并将其分解为偶然不确定性与认知不确定性分量来度量其不确定性。我们特别关注能更有效捕捉感知相关错误的认知不确定性。受dropout正则化思想启发,我们提出不确定性引导的标记丢弃策略,将dropout原理应用于输入视觉标记而非模型参数,并在推理阶段而非训练阶段执行。通过聚合来自多个掩蔽解码上下文的预测结果,Dropout解码能稳健地缓解因视觉标记误解而产生的误差。在CHAIR、THRONE及MMBench等基准测试上的评估表明,Dropout解码能显著减少物体幻觉现象,并在多样化视觉场景中全面提升大型视觉-语言模型输出的可靠性与质量。