Generalization in robotic manipulation remains a critical challenge, particularly when scaling to new environments with limited demonstrations. This paper introduces CAGE, a novel robotic manipulation policy designed to overcome these generalization barriers by integrating a causal attention mechanism. CAGE utilizes the powerful feature extraction capabilities of the vision foundation model DINOv2, combined with LoRA fine-tuning for robust environment understanding. The policy further employs a causal Perceiver for effective token compression and a diffusion-based action prediction head with attention mechanisms to enhance task-specific fine-grained conditioning. With as few as 50 demonstrations from a single training environment, CAGE achieves robust generalization across diverse visual changes in objects, backgrounds, and viewpoints. Extensive experiments validate that CAGE significantly outperforms existing state-of-the-art RGB/RGB-D approaches in various manipulation tasks, especially under large distribution shifts. In similar environments, CAGE offers an average of 42% increase in task completion rate. While all baselines fail to execute the task in unseen environments, CAGE manages to obtain a 43% completion rate and a 51% success rate in average, making a huge step towards practical deployment of robots in real-world settings. Project website: cage-policy.github.io.
翻译:机器人操作中的泛化能力仍是一个关键挑战,尤其是在仅有有限演示数据的情况下扩展到新环境时。本文提出CAGE,一种新颖的机器人操作策略,通过集成因果注意力机制来克服这些泛化障碍。CAGE利用视觉基础模型DINOv2强大的特征提取能力,结合LoRA微调以实现鲁棒的环境理解。该策略进一步采用因果感知器进行有效的令牌压缩,并配备基于扩散的动作预测头及注意力机制,以增强任务特定的细粒度条件控制。仅需来自单一训练环境的50个演示样本,CAGE就能在物体、背景和视角的多样化视觉变化中实现鲁棒泛化。大量实验验证表明,在各种操作任务中,CAGE显著优于现有的RGB/RGB-D先进方法,尤其是在存在较大分布偏移的情况下。在相似环境中,CAGE的任务完成率平均提升42%。当所有基线方法在未见环境中均无法执行任务时,CAGE平均仍能获得43%的完成率和51%的成功率,这为机器人在实际场景中的实用化部署迈出了重要一步。项目网站:cage-policy.github.io。