In recent years, Visual Question Answering (VQA) has made significant strides, particularly with the advent of multimodal models that integrate vision and language understanding. However, existing VQA datasets often overlook the complexities introduced by image illusions, which pose unique challenges for both human perception and model interpretation. In this study, we introduce a novel task called Illusory VQA, along with four specialized datasets: IllusionMNIST, IllusionFashionMNIST, IllusionAnimals, and IllusionChar. These datasets are designed to evaluate the performance of state-of-the-art multimodal models in recognizing and interpreting visual illusions. We assess the zero-shot performance of various models, fine-tune selected models on our datasets, and propose a simple yet effective solution for illusion detection using Gaussian and blur low-pass filters. We show that this method increases the performance of models significantly and in the case of BLIP-2 on IllusionAnimals without any fine-tuning, it outperforms humans. Our findings highlight the disparity between human and model perception of illusions and demonstrate that fine-tuning and specific preprocessing techniques can significantly enhance model robustness. This work contributes to the development of more human-like visual understanding in multimodal models and suggests future directions for adapting filters using learnable parameters.
翻译:近年来,视觉问答(VQA)领域取得了显著进展,尤其是在整合视觉与语言理解的多模态模型出现之后。然而,现有的VQA数据集常常忽视了图像错觉所带来的复杂性,这对人类感知和模型解释都构成了独特的挑战。在本研究中,我们引入了一项名为“错觉视觉问答”的新任务,并构建了四个专用数据集:IllusionMNIST、IllusionFashionMNIST、IllusionAnimals和IllusionChar。这些数据集旨在评估最先进的多模态模型在识别和解释视觉错觉方面的性能。我们评估了多种模型的零样本性能,在我们的数据集上对选定模型进行了微调,并提出了一种使用高斯和模糊低通滤波器进行错觉检测的简单而有效的解决方案。我们证明,该方法显著提升了模型的性能;在BLIP-2模型于IllusionAnimals数据集上的零样本测试中,其表现甚至超越了人类。我们的研究结果凸显了人类与模型在错觉感知上的差异,并表明微调和特定的预处理技术能显著增强模型的鲁棒性。这项工作有助于推动多模态模型发展出更类人的视觉理解能力,并为使用可学习参数自适应滤波器指明了未来方向。