Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual-correlated tokens without fine-grained annotations. Specifically, we introduce a token-level \emph{visual-anchored} \emph{reward} as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLAVA-1.5-7B, our TPO boosts the performance absolute improvement for hallucination benchmarks.
翻译:直接偏好优化(DPO)已被证明在缓解大型视觉语言模型(LVLMs)的幻觉问题上非常有效,它通过使模型输出更贴近人类偏好来实现这一目标。尽管近期取得了进展,现有方法仍存在两个缺陷:1)缺乏可扩展的令牌级奖励机制;2)忽视了视觉锚定令牌的作用。为此,我们提出了一种具有自校准奖励的新型令牌偏好优化模型(简称为TPO),该模型能够自适应地关注视觉相关令牌,而无需细粒度标注。具体而言,我们引入了一种令牌级的视觉锚定奖励,该奖励定义为在原始图像和受损图像条件下生成令牌的逻辑分布差异。此外,为突出信息丰富的视觉锚定令牌,我们提出了一种视觉感知训练目标,以增强更精确的令牌级优化能力。大量实验结果证明了所提出的TPO模型达到了最先进的性能水平。例如,在LLaVA-1.5-7B模型基础上,我们的TPO在幻觉基准测试中实现了显著的绝对性能提升。