Recently, Multimodal Large Language Models (MLLMs) have sparked great research interests owing to their exceptional content-reasoning and instruction-following capabilities. To effectively instruct an MLLM, in addition to conventional language expressions, the practice of referring to objects by painting with brushes on images has emerged as a prevalent tool (referred to as "referring visual prompts") due to its efficacy in aligning the user's intention with specific image regions. To accommodate the most common referring visual prompts, namely points, boxes, and masks, existing approaches initially utilize specialized feature encoding modules to capture the semantics of the highlighted areas indicated by these prompts. Subsequently, these encoded region features are adapted to MLLMs through fine-tuning on a meticulously curated multimodal instruction dataset. However, such designs suffer from redundancy in architecture. Moreover, they face challenges in effectively generalizing when encountering a diverse range of arbitrary referring visual prompts in real-life scenarios. To address the above issues, we propose EAGLE, a novel MLLM that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches. Specifically, our EAGLE maintains the innate format of the referring visual prompts as colored patches rendered on the given image for conducting the instruction tuning. Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM, with the semantic comprehension of these regions originating from the MLLM itself. Besides, we also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts. Extensive experiments are conducted to prove the effectiveness of our proposed method.
翻译:近年来,多模态大语言模型(MLLMs)因其卓越的内容推理与指令跟随能力引发了广泛的研究兴趣。为有效指导MLLMs,除传统语言表达外,通过在图像上使用画笔标注物体来指示目标区域(称为“参考视觉提示”)已成为主流工具,因其能有效将用户意图与特定图像区域对齐。为适配最常见的点、框、掩码等参考视觉提示,现有方法通常先采用专用特征编码模块提取提示所标注区域的语义特征,再通过在多模态指令数据集上的精细微调将这些区域特征适配至MLLMs。然而,此类设计存在架构冗余问题,且难以有效泛化至现实场景中形式多样的任意参考视觉提示。为解决上述问题,我们提出EAGLE——一种新型MLLM,能以低于现有方法的训练成本实现对任意参考视觉提示的理解。具体而言,EAGLE将参考视觉提示保持为渲染在原始图像上的彩色图块格式进行指令微调。该方法将参考视觉提示嵌入为可被MLLM理解的空间概念以传递特定空间区域信息,而对这些区域的语义理解完全源于MLLM自身。此外,我们提出几何无关学习范式(GAL),进一步解耦MLLM的区域级理解能力与参考视觉提示的具体形式。大量实验证明了所提方法的有效性。