This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 69.5 in the MMBench test and 1552.5 in MME-perception. The code is available at: https://github.com/dvlab-research/Prompt-Highlighter/
翻译:摘要:本研究聚焦于多模态大语言模型(LLMs/VLMs)推理中的关键问题:显式可控文本生成。多模态大语言模型通过语义生成能力赋能多模态理解,但其自回归生成特性导致可解释性降低,且对提示内容的依赖程度加重。尽管调整提示格式能改善输出结果,但针对每项任务设计精确的特定提示既具挑战性又缺乏效率。为解决该问题,我们提出一种新型推理方法——Prompt Highlighter,该方法允许用户突出显示特定提示片段,从而在生成过程中交互式控制关注焦点。受无分类器扩散引导的启发,我们基于高亮标记构建了常规上下文与无条件上下文配对,论证了模型中的自回归生成可通过无分类器方式进行引导。值得注意的是,我们发现推理过程中通过注意力权重引导模型关注高亮标记,能产出更符合预期的输出。该方法兼容当前主流LLM与VLM,无需训练即可实现显著的自定义生成效果。实验证实了该方法在聚焦输入上下文与生成可靠内容方面的有效性。在无需微调LLaVA-v1.5的情况下,本方法在MMBench测试中取得69.5分,在MME感知评估中达到1552.5分。代码开源地址:https://github.com/dvlab-research/Prompt-Highlighter/