While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.
翻译:尽管自回归大型视觉语言模型在多模态任务中展现出卓越能力,但它们面临“视觉信号稀释”现象——随着文本历史的积累,注意力划分函数发生扩展,导致视觉注意力随生成序列长度呈反比衰减。为应对这一问题,我们提出持久视觉记忆,一种轻量级可学习模块,旨在确保持续、按需的视觉感知。该模块作为前馈网络的并行分支集成于大型视觉语言模型中,建立了一种与距离无关的检索通路,可直接提供视觉嵌入以实现精确的视觉感知,从而在结构上缓解深度生成中固有的信号抑制。在Qwen3-VL模型上的大量实验表明,持久视觉记忆以可忽略的参数开销带来了显著改进,在4B和8B两个规模上均实现了一致的平均准确率提升,尤其在需要持久视觉感知的复杂推理任务中表现突出。此外,深入分析揭示,持久视觉记忆能够抵抗由长度引起的信号衰减并加速内部预测收敛。