Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.
翻译:视觉语言模型通常由视觉编码器(如CLIP)和语言模型组成,后者通过解释编码后的特征来解决下游任务。尽管取得了显著进展,但由于视觉编码器能力有限,视觉语言模型仍存在若干缺陷,例如对某些图像特征的“盲视”、视觉幻觉等。为解决这些问题,我们研究拓展视觉语言模型的视觉编码能力。首先,我们系统性地评估了多种具有不同归纳偏好的视觉编码器在解决视觉语言模型任务时的性能。我们观察到,没有任何单一编码配置能在不同任务中持续达到最优性能,且具有不同偏好的编码器表现可能惊人地相似。受此启发,我们提出了一种名为BRAVE的方法,该方法将多个冻结编码器的特征整合为更具通用性的表示,可直接作为冻结语言模型的输入。BRAVE在广泛的图像描述和视觉问答基准测试中达到了最先进的性能,同时显著减少了上述视觉语言模型问题,且所需可训练参数数量少于现有方法,表示更加紧凑。我们的结果凸显了融合不同视觉偏好在实现更广泛、更具情境性的视觉语言模型视觉理解方面的潜力。