While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.
翻译:自回归大型视觉语言模型(LVLMs)在多模态任务中展现出显著能力,但面临"视觉信号稀释"现象:文本历史积累会扩大注意力配分函数,导致视觉注意力随生成序列长度呈反比衰减。为应对该问题,我们提出持续视觉记忆(PVM)——一种轻量级可学习模块,旨在增强对视觉证据的持续按需访问。PVM作为并行分支集成于LVLM的前馈网络(FFN)中,构建了无视距离的检索通路,直接提供视觉嵌入以增强视觉感知,从而在结构上缓解深层生成固有的信号抑制问题。在Qwen3-VL模型上的大量实验表明,PVM以极少的参数开销带来显著性能提升,在4B和8B规模上均实现平均精度的持续提升,尤其在需要持续视觉感知的复杂推理任务中表现突出。进一步深度分析显示,PVM在长序列生成中展现出更强的鲁棒性,并加速了内部预测收敛。