Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers. Designed as a lightweight and plug-and-play module, Mema integrates seamlessly into pretrained vision encoders without modifying the vanilla backbone architecture. Only a minimal set of additional parameters requires training, enabling adaptive visual feature refinement while reducing training overhead. Extensive experiments across multiple benchmarks demonstrate that Mema consistently improves performance, validating its effectiveness in complex multimodal reasoning tasks. The code have been released at https://github.com/Sisiliu312/Mema.
翻译:多模态大语言模型通过将预训练的视觉表示与大语言模型中的语言知识对齐,取得了显著性能。然而现有方法通常依赖最终层视觉特征或可学习的多层融合机制,由于缺乏显式的跨层交互设计,往往无法充分利用层次化视觉线索。本文提出在视觉编码器中嵌入记忆增强适配器(Mema)。具体而言,Mema维护一个跨层累积层次化视觉表示的状态记忆,其演化过程同时受查询嵌入和逐层视觉特征的调控。通过反馈机制,部分记忆被选择性注入令牌表征中,从而缓解浅层细粒度视觉线索的衰减。作为轻量级即插即用模块,Mema能无缝集成到预训练视觉编码器中,无需修改原始骨干网络架构。仅需训练极少额外参数,即可实现自适应的视觉特征优化并降低训练开销。在多个基准测试上的大量实验表明,Mema能够持续提升性能,验证了其在复杂多模态推理任务中的有效性。相关代码已发布在https://github.com/Sisiliu312/Mema。