Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layer of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against multi-modal jailbreaking attack, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, indicating potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicting uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improve models' performance but is still inferior to linear probing on these tasks.
翻译:大型视觉语言模型(LVLMs)旨在理解并响应人类指令,但有时会因不当指令生成幻觉或有害内容。本研究采用线性探测方法,揭示LVLMs输出层中的隐藏知识。我们证明首个词元的对数几率分布包含足够信息,可用于判断是否响应指令,包括识别不可回答的视觉问题、防御多模态越狱攻击以及识别欺骗性问题。这种隐藏知识在响应生成过程中会随后续词元的对数几率逐渐消失。随后,我们提出一种在首个词元生成阶段的简单解码策略,有效改善了生成内容。实验中发现若干有趣现象:第一,CLIP模型已包含解决这些任务的强信号,表明现有数据集存在潜在偏差;第二,利用首个对数几率分布能在三项额外任务中提升性能,包括数学求解中的不确定性指示、幻觉缓解和图像分类;第三,使用相同训练数据时,简单微调LVLMs虽能提升模型性能,但在这些任务上仍逊于线性探测方法。