Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions

In recent years, Visual Question Answering (VQA) has made significant strides, particularly with the advent of multimodal models that integrate vision and language understanding. However, existing VQA datasets often overlook the complexities introduced by image illusions, which pose unique challenges for both human perception and model interpretation. In this study, we introduce a novel task called Illusory VQA, along with four specialized datasets: IllusionMNIST, IllusionFashionMNIST, IllusionAnimals, and IllusionChar. These datasets are designed to evaluate the performance of state-of-the-art multimodal models in recognizing and interpreting visual illusions. We assess the zero-shot performance of various models, fine-tune selected models on our datasets, and propose a simple yet effective solution for illusion detection using Gaussian and blur low-pass filters. We show that this method increases the performance of models significantly and in the case of BLIP-2 on IllusionAnimals without any fine-tuning, it outperforms humans. Our findings highlight the disparity between human and model perception of illusions and demonstrate that fine-tuning and specific preprocessing techniques can significantly enhance model robustness. This work contributes to the development of more human-like visual understanding in multimodal models and suggests future directions for adapting filters using learnable parameters.

翻译：近年来，视觉问答（VQA）领域取得了显著进展，尤其是在整合视觉与语言理解的多模态模型出现之后。然而，现有的VQA数据集常常忽视了图像错觉所带来的复杂性，这对人类感知和模型解释都构成了独特的挑战。在本研究中，我们引入了一项名为“错觉视觉问答”的新任务，并构建了四个专用数据集：IllusionMNIST、IllusionFashionMNIST、IllusionAnimals和IllusionChar。这些数据集旨在评估最先进的多模态模型在识别和解释视觉错觉方面的性能。我们评估了多种模型的零样本性能，在我们的数据集上对选定模型进行了微调，并提出了一种使用高斯和模糊低通滤波器进行错觉检测的简单而有效的解决方案。我们证明，该方法显著提升了模型的性能；在BLIP-2模型于IllusionAnimals数据集上的零样本测试中，其表现甚至超越了人类。我们的研究结果凸显了人类与模型在错觉感知上的差异，并表明微调和特定的预处理技术能显著增强模型的鲁棒性。这项工作有助于推动多模态模型发展出更类人的视觉理解能力，并为使用可学习参数自适应滤波器指明了未来方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日