This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD). UPD examines the VLM's ability to withhold answers when faced with unsolvable problems in the context of Visual Question Answering (VQA) tasks. UPD encompasses three distinct settings: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD). To deeply investigate the UPD problem, extensive experiments indicate that most VLMs, including GPT-4V and LLaVA-Next-34B, struggle with our benchmarks to varying extents, highlighting significant room for the improvements. To address UPD, we explore both training-free and training-based solutions, offering new insights into their effectiveness and limitations. We hope our insights, together with future efforts within the proposed UPD settings, will enhance the broader understanding and development of more practical and reliable VLMs.
翻译:本文为视觉语言模型(VLM)引入了一项新颖且重要的挑战,称为不可解问题检测(UPD)。UPD考察的是在视觉问答(VQA)任务背景下,当面对不可解问题时,VLM是否具备拒绝回答的能力。UPD包含三种不同的设定:缺失答案检测(AAD)、不兼容答案集检测(IASD)和不兼容视觉问题检测(IVQD)。为深入研究UPD问题,大量实验表明,包括GPT-4V和LLaVA-Next-34B在内的大多数VLM,在我们的基准测试中均存在不同程度的困难,这凸显了其性能尚有巨大的提升空间。为解决UPD问题,我们探索了无需训练和基于训练的解决方案,并对其有效性和局限性提供了新的见解。我们希望,我们的见解结合未来在既定UPD设定下的努力,能够增进对更实用、更可靠的VLM的广泛理解,并推动其发展。