In recent years, synthetic visual instructions by generative language model have demonstrated plausible text generation performance on the visual question-answering tasks. However, challenges persist in the hallucination of generative language models, i.e., the generated image-text data contains unintended contents. This paper presents a novel and scalable method for generating visually dehallucinative instructions, dubbed CAP2QA, that constrains the scope to only image contents. Our key contributions lie in introducing image-aligned instructive QA dataset CAP2QA-COCO and its scalable recipe. In our experiments, we compare synthetic visual instruction datasets that share the same source data by visual instruction tuning and conduct general visual recognition tasks. It shows that our proposed method significantly reduces visual hallucination while consistently improving visual recognition ability and expressiveness.
翻译:近年来,生成式语言模型合成的视觉指令在视觉问答任务中展现出了合理的文本生成性能。然而,生成式语言模型的幻觉问题依然存在,即生成的图文数据包含不符合预期的内容。本文提出了一种新颖且可扩展的视觉去幻觉指令生成方法CAP2QA,该方法将生成范围严格限定于图像内容。我们的核心贡献在于引入了与图像对齐的指导性问答数据集CAP2QA-COCO及其可扩展的生成方案。在实验中,我们通过视觉指令微调比较了共享相同源数据的合成视觉指令数据集,并执行了通用视觉识别任务。结果表明,我们的方法显著减少了视觉幻觉,同时持续提升了视觉识别能力与表现力。