While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. We will release our code and models.
翻译:尽管近年图像条件文本生成取得了快速进展,图像描述仍存在幻觉这一基本问题,即生成无法从给定图像中推断出的虚假细节。现有方法主要使用封闭词汇表对象列表来缓解或评估图像描述中的幻觉,忽略了实际中幻觉的长尾特性。为此,我们提出一个在开放词汇设置下处理图像描述幻觉的框架。该框架包含新基准OpenCHAIR,利用生成式基础模型评估图像描述的开放词汇对象幻觉,在多样性和准确性上均超越流行的同类基准CHAIR。此外,为在不使用封闭对象列表的情况下缓解开放词汇幻觉,我们提出MOCHa方法,利用强化学习进展实现优化。其多目标奖励函数明确针对生成中忠实度与充分性之间的权衡,无需任何强监督。MOCHa显著改进了多种图像描述模型,这已通过OpenCHAIR基准及其他现有指标得到验证。我们将发布代码和模型。