Foundation models rely on in-context learning for personalized decision making. The limited size of this context window necessitates memory compression and retrieval systems like RAG. These systems however often treat memory as large offline storage spaces, which is unfavorable for embodied agents that are expected to operate under strict memory and compute constraints, online. In this work, we propose MemCtrl, a novel framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online. MemCtrl augments MLLMs with a trainable memory head μthat acts as a gate to determine which observations or reflections to retain, update, or discard during exploration. We evaluate with training two types of μ, 1) via an offline expert, and 2) via online RL, and observe significant improvement in overall embodied task completion ability on μ-augmented MLLMs. In particular, on augmenting two low performing MLLMs with MemCtrl on multiple subsets of the EmbodiedBench benchmark, we observe that μ-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets. Finally, we present a qualitative analysis on the memory fragments collected by μ, noting the superior performance of μaugmented MLLMs on long and complex instruction types.
翻译:基础模型依赖上下文学习进行个性化决策。有限的上下文窗口大小使得需要诸如RAG的记忆压缩与检索系统。然而,这些系统通常将记忆视为大型离线存储空间,这对于需要在严格内存与计算限制下在线运行的具身智能体而言并不理想。在本工作中,我们提出MemCtrl,一种利用多模态大语言模型在线修剪记忆的新框架。MemCtrl通过一个可训练的记忆头μ来增强MLLMs,该记忆头作为门控机制,决定在探索过程中哪些观测或反思应被保留、更新或丢弃。我们通过训练两种类型的μ进行评估:1)通过离线专家训练,2)通过在线强化学习训练,并观察到经μ增强的MLLMs在整体具身任务完成能力上取得显著提升。特别是在EmbodiedBench基准测试的多个子集上,为两个性能较低的MLLMs增强MemCtrl后,我们发现经μ增强的MLLMs平均提升约16%,在特定指令子集上提升超过20%。最后,我们对μ收集的记忆片段进行了定性分析,指出经μ增强的MLLMs在长且复杂的指令类型上表现更优。