The cognitive faculty of visual reasoning necessitates the integration of multimodal perceptual processing and commonsense and external knowledge of the world. In recent years, a plethora of large vision-language models (LVLMs) have been proposed, demonstrating outstanding power and exceptional proficiency in commonsense reasoning across diverse domains and tasks. Nevertheless, training such LVLMs requires a lot of costly resources. Recent approaches, instead of training LVLMs from scratch on various large datasets, focus on exploring ways to take advantage of the capabilities of many different LVLMs, such as ensemble methods. In this work, we propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters, a training-free method. Our key insight is that we realized that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities. Extensive experiments on various benchmarks demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on SketchyVQA, Outside Knowledge VQA, and out-of-distribution VQA tasks.
翻译:视觉推理的认知能力需要整合多模态感知处理与常识及外部世界知识。近年来,大量大型视觉语言模型被提出,在跨领域与跨任务的常识推理中展现出卓越的能力与优异的表现。然而,训练此类大型视觉语言模型需要耗费大量昂贵资源。近期方法不再基于各类大型数据集从头训练大型视觉语言模型,转而探索如何利用多种不同大型视觉语言模型的能力,例如集成方法。本工作中,我们提出自集成这一新颖方法,该方法无需更新任何参数即可提升模型的泛化能力与视觉推理性能,属于免训练方法。我们的核心洞见在于认识到大型视觉语言模型本身无需依赖其他模型即可实现自我集成,这有助于释放其内在潜力。在多种基准测试上的广泛实验表明,我们的方法在SketchyVQA、外部知识VQA以及分布外VQA任务中均实现了最先进的性能。