We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. Beyond holistic image understanding, Groma is adept at region-level tasks such as region captioning and visual grounding. Such capabilities are built upon a localized visual tokenization mechanism, where an image input is decomposed into regions of interest and subsequently encoded into region tokens. By integrating region tokens into user instructions and model responses, we seamlessly enable Groma to understand user-specified region inputs and ground its textual output to images. Besides, to enhance the grounded chat ability of Groma, we curate a visually grounded instruction dataset by leveraging the powerful GPT-4V and visual prompting techniques. Compared with MLLMs that rely on the language model or external module for localization, Groma consistently demonstrates superior performances in standard referring and grounding benchmarks, highlighting the advantages of embedding localization into image tokenization. Project page: https://groma-mllm.github.io/.
翻译:我们提出Groma——一种具备接地且细粒度视觉感知能力的多模态大语言模型(MLLM)。除整体图像理解外,Groma擅长区域级任务,例如区域描述与视觉接地。此类能力建立在局部化视觉标记化机制之上:输入图像被分解为感兴趣区域,随后编码为区域标记。通过将区域标记集成至用户指令与模型响应中,我们使Groma能够无缝理解用户指定的区域输入,并将其文本输出与图像对齐。此外,为增强Groma的接地对话能力,我们利用强大的GPT-4V与视觉提示技术,构建了视觉接地指令数据集。相较于依赖语言模型或外部模块进行定位的MLLM,Groma在标准的引用与接地基准测试中始终展现更优性能,凸显了将定位嵌入图像标记化的优势。项目页面:https://groma-mllm.github.io/。