Large Vision-Language Models (LVLMs) have made remarkable developments along with the recent surge of large language models. Despite their advancements, LVLMs have a tendency to generate plausible yet inaccurate or inconsistent information based on the provided source content. This phenomenon, also known as ``hallucinations" can have serious downstream implications during the deployment of LVLMs. To address this, we present VORD a simple and effective method that alleviates hallucinations by calibrating token predictions based on ordinal relationships between modified image pairs. VORD is presented in two forms: 1.) a minimalist training-free variant which eliminates implausible tokens from modified image pairs, and 2.) a trainable objective function that penalizes unlikely tokens. Our experiments demonstrate that VORD delivers better calibration and effectively mitigates object hallucinations on a wide-range of LVLM benchmarks.
翻译:大型视觉语言模型(LVLMs)伴随着近期大型语言模型的蓬勃发展取得了显著进展。尽管取得了这些进步,LVLMs 倾向于根据所提供的源内容生成看似合理但不准确或不一致的信息。这种现象,亦称为“幻觉”,在 LVLMs 的部署过程中可能产生严重的下游影响。为解决此问题,我们提出了 VORD,一种简单而有效的方法,它通过基于修改后图像对之间的序数关系来校准词元预测,从而缓解幻觉。VORD 以两种形式呈现:1.) 一种极简的免训练变体,可从修改后的图像对中消除不合理的词元;2.) 一种可训练的目标函数,用于惩罚可能性低的词元。我们的实验表明,VORD 在广泛的 LVLM 基准测试上实现了更好的校准效果,并有效缓解了物体幻觉。