An essential topic for multimodal large language models (MLLMs) is aligning vision and language concepts at a finer level. In particular, we devote efforts to encoding visual referential information for tasks such as referring and grounding. Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location, bringing extra burdens in training MLLMs to communicate between language and vision. This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens. It allows us to unify the prompt and answer of visual referential tasks without additional syntax. Upon a joint vision-language vocabulary, ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture. Experiments validate that our model achieves competitive performance across visual referring and grounding tasks with a reduced demand for training data. Additionally, ClawMachine demonstrates a native ability to integrate multi-source information for complex visual reasoning, which prior MLLMs can hardly perform without specific adaptions.
翻译:多模态大语言模型(MLLMs)的一个核心课题在于实现视觉与语言概念在更细粒度上的对齐。我们特别致力于为指代与定位等任务编码视觉参照信息。现有方法(包括代理编码与几何编码)通过引入额外语法结构来编码物体位置,这为训练MLLMs实现跨模态交互带来了额外负担。本研究提出ClawMachine,其创新方法在于直接使用视觉标记标注实体。该方案使我们能够在不引入额外语法的情况下,统一视觉指代任务的提示与应答格式。基于联合视觉-语言词汇表,ClawMachine将视觉指代与定位任务统一为自回归形式,并采用纯解码器架构进行学习。实验验证表明,我们的模型在视觉指代与定位任务上取得了具有竞争力的性能,同时降低了对训练数据量的需求。此外,ClawMachine展现出融合多源信息进行复杂视觉推理的固有能力,而现有MLLMs通常需经专门适配才能实现此类功能。