In this work, we propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) through learnable visual token optimization. We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. Our approach involves adjusting visual tokens from the MLP output during inference, controlling which text prompt tokens attend to which visual tokens. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referential abilities into MLLMs. Our method support referring with box, mask, scribble and point. The results demonstrate that our method exhibits controllability and interpretability.
翻译:本文提出一种无需训练的方法,通过可学习的视觉令牌优化将视觉指代能力注入多模态大语言模型(MLLMs)。我们观察到MLLMs中文本提示令牌与视觉令牌之间的关联性,其中注意力层负责建模二者的连接关系。我们的方法通过在推理阶段调整来自MLP输出的视觉令牌,控制特定文本提示令牌对特定视觉令牌的关注程度。基于能量函数优化可学习的视觉令牌,从而增强注意力图中指代区域的响应强度。该方法能够实现精细的区域描述与推理,且无需高昂的训练成本或模型重训练。本方法为将指代能力集成到MLLMs中提供了可行方向,支持框选、掩码、涂鸦及点选等多种指代形式。实验结果表明,该方法具有良好的可控性与可解释性。