This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 70.7 in the MMBench test and 1552.5 in MME-perception. The code is available at: https://github.com/dvlab-research/Prompt-Highlighter/
翻译:本研究针对多模态大语言模型(LLMs&VLMs)推理的关键问题:显式可控文本生成。多模态大语言模型以语义生成能力赋能多模态理解,但其自回归生成特性导致可解释性降低且对提示内容依赖增强。尽管调整提示格式可改善输出结果,但针对具体任务设计精确提示词既具挑战性又效果欠佳。为解决该问题,我们提出新型推理方法Prompt Highlighter,支持用户通过高亮特定提示片段在生成过程中进行交互式焦点控制。受无分类器扩散引导启发,我们基于高亮标记构建常规与无条件的上下文配对,证明模型的自回归生成可通过无分类器方式进行引导。值得注意的是,我们发现推理阶段通过注意力权重引导模型关注高亮标记,可显著提升输出质量。该方法兼容当前主流LLMs与VLMs,无需训练即可实现令人满意的定制化生成效果。实验证实该方法能有效聚焦输入上下文并生成可靠内容。在未微调LLaVA-v1.5条件下,本方法在MMBench测试中取得70.7分,在MME-perception测试中取得1552.5分。代码已开源至:https://github.com/dvlab-research/Prompt-Highlighter/