BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .

翻译：大语言模型（LLMs）通过与人类进行语言交互展现出了卓越的能力，尤其是在使用指令遵循数据后。最近的LLMs进展（如MiniGPT-4、LLaVA和X-LLM）通过整合多模态输入（包括图像、视频和语音）进一步扩展了这些能力。尽管这些模型在生成对给定模态信号的精确和详细语言理解方面效果显著，但它们放弃了定位输入中特定部分的能力，因此仅构建了粗粒度的映射关系。然而，文本与其他模态之间明确且信息丰富的对应关系不仅会改善用户体验，还有助于扩展多模态LLMs的应用场景。为此，我们提出了BuboGPT，一种具有视觉定位能力的多模态LLM，它能够执行视觉、音频和语言之间的跨模态交互，提供对视觉对象及其他给定模态的细粒度理解。因此，当BuboGPT针对某个对象生成响应或描述时，它能够指出该对象在图像中的具体位置。我们的贡献有两方面：1）基于SAM的现成视觉定位模块，可提取句子中的实体并找到图像中对应的掩码；2）两阶段训练方案和指令数据集，以实现文本-图像-音频的联合理解。实验表明，BuboGPT在与人类交互过程中展现出令人印象深刻的多模态理解与视觉定位能力，并且在任意模态组合（无论是否对齐）下均表现一致。我们的代码、模型和数据集可在https://bubo-gpt.github.io 获取。