When we look around and perform complex tasks, how we see and selectively process what we see is crucial. However, the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details, especially when handling high-resolution and visually crowded images. To address this, we introduce V*, an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM, this mechanism enhances collaborative reasoning, contextual understanding, and precise targeting of specific visual elements. This integration results in a new MLLM meta-architecture, named Show, sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically designed to evaluate MLLMs in their ability to process high-resolution images and focus on visual details. Our study highlights the necessity of incorporating visual search capabilities into multimodal systems. The code is available https://github.com/penghao-wu/vstar.
翻译:当我们环顾四周执行复杂任务时,如何观察并有选择性地处理所见信息至关重要。然而,当前多模态大语言模型(MLLMs)缺乏这种视觉搜索机制,这阻碍了它们聚焦重要视觉细节的能力,尤其在处理高分辨率和视觉拥挤的图像时。为解决这一问题,我们提出V*——一种由大语言模型引导的视觉搜索机制,该机制利用大语言模型中的世界知识实现高效的视觉查询。当与MLLM结合时,该机制增强了协作推理、情境理解以及对特定视觉元素的精准定位能力。这种集成产生了一种新的MLLM元架构,命名为Show, sEArch, and TelL(SEAL)。我们还创建了V*Bench,这是一个专门设计的基准测试,用于评估MLLM处理高分辨率图像并聚焦视觉细节的能力。本研究强调了将视觉搜索能力整合到多模态系统中的必要性。代码开源地址:https://github.com/penghao-wu/vstar。