Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.
翻译:多模态大语言模型 (MLLMs) 依赖上下文学习 (ICL) 实现快速任务适应,但其可扩展性严重受限于有限的上下文窗口以及长多模态序列中键值 (KV) 缓存的成本增长。现有的记忆压缩方法通常依赖于刚性的标记移除或基于样本的重要性估计,这引入了偏差,破坏了语义结构(尤其是视觉表示),并产生无法适应新查询的静态记忆。我们提出了 TASM(任务感知结构化记忆),一种无需训练的新框架,通过任务感知、结构保持和动态可访问的记忆构建来解决这些限制。TASM 采用任务向量引导的压缩,以捕捉演示间共享相关性的任务级方向取代特定样本信号。为保留底层流形,它通过二分图匹配应用语义感知的标记合并,在不进行破坏性修剪的情况下聚合标记。最后,TASM 将记忆结构化为一个包含紧凑核心记忆和潜在库的层次结构,以促进查询自适应的动态检索。评估证实,TASM 在重度压缩下仍能保持高性能,有效平衡了效率与适应性。