While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, the generation of spurious details that cannot be inferred from the given image. Dedicated methods for reducing hallucinations in image captioning largely focus on closed-vocabulary object tokens, ignoring most types of hallucinations that occur in practice. In this work, we propose MOCHa, an approach that harnesses advancements in reinforcement learning (RL) to address the sequence-level nature of hallucinations in an open-world setup. To optimize for caption fidelity to the input image, we leverage ground-truth reference captions as proxies to measure the logical consistency of generated captions. However, optimizing for caption fidelity alone fails to preserve the semantic adequacy of generations; therefore, we propose a multi-objective reward function that jointly targets these qualities, without requiring any strong supervision. We demonstrate that these goals can be simultaneously optimized with our framework, enhancing performance for various captioning models of different scales. Our qualitative and quantitative results demonstrate MOCHa's superior performance across various established metrics. We also demonstrate the benefit of our method in the open-vocabulary setting. To this end, we contribute OpenCHAIR, a new benchmark for quantifying open-vocabulary hallucinations in image captioning models, constructed using generative foundation models. We will release our code, benchmark, and trained models.
翻译:尽管近年来图像条件文本生成取得了快速进展,但图像描述仍面临幻觉这一根本性问题,即生成无法从给定图像中推断出的虚假细节。当前专门用于减少图像描述幻觉的方法主要关注封闭词汇的对象标记,忽略了实际发生的绝大多数幻觉类型。在本工作中,我们提出MOCHa方法,利用强化学习(RL)的进展来解决开放世界场景中幻觉的序列级特性。为了优化描述对输入图像的忠实度,我们利用真实参考描述作为代理,以衡量生成描述的逻辑一致性。然而,仅优化描述忠实度无法保留生成内容的语义充分性;因此,我们提出一种多目标奖励函数,该函数在不需任何强监督的情况下联合优化这些质量指标。我们证明这些目标可以通过我们的框架同时优化,从而提升不同规模多种描述模型的性能。定性和定量结果表明,MOCHa在各种既定指标上均展现出卓越性能。我们还展示了该方法在开放词汇设定中的优势。为此,我们贡献了OpenCHAIR这一新基准,用于量化图像描述模型中的开放词汇幻觉,该基准利用生成式基础模型构建。我们将公开我们的代码、基准和训练模型。