Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.
翻译:摘要:机器人操作常需依赖记忆:遮挡与状态变化会导致决策时刻的观测产生感知混淆,使得观测层面的动作选择呈现非马尔可夫性——相同观测可能源自不同的交互历史。大多数具身智能体通过语义压缩表征与基于相似度的检索实现记忆功能,但这会丢弃用于消解歧义的细粒度感知线索,且可能返回感知相似却与决策无关的场景片段。受人类情景记忆启发,我们提出Chameleon模型,通过写入几何约束的多模态标记保留消歧上下文,并借助可微分记忆栈实现目标导向式回忆。同时引入Camo-Dataset——面向真实机器人UR5e的场景数据集,涵盖感知混淆条件下的场景回忆、空间追踪与序列化操作。实验表明,在各类感知干扰场景中,Chameleon相较强基线方法持续提升了决策鲁棒性与长时域控制性能。