In this work, we propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) through learnable visual token optimization. We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. Our approach involves adjusting visual tokens from the MLP output during inference, controlling which text prompt tokens attend to which visual tokens. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referential abilities into MLLMs. Our method support referring with box, mask, scribble and point. The results demonstrate that our method exhibits controllability and interpretability.
翻译:本文提出一种无需训练的方法,通过可学习的视觉令牌优化将视觉指代能力注入多模态大语言模型(MLLMs)。我们观察到MLLMs中文本提示令牌与视觉令牌之间的关联性,其中注意力层负责建模二者间的连接关系。该方法通过在推理阶段调整来自MLP输出的视觉令牌,控制特定文本提示令牌对特定视觉令牌的关注程度。我们基于能量函数优化可学习的视觉令牌,从而增强注意力图中指代区域的响应强度。这使得模型能够进行精细的区域描述与推理,而无需承担高昂的训练成本或进行模型重训练。本方法为将指代能力集成至MLLMs提供了可行路径,支持基于边界框、掩码、涂鸦和点等多种形式的指代输入。实验结果表明,该方法具有良好的可控性与可解释性。