View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views to address these challenges, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human cortical dorsal pathway. Subsequently, the complementary view representations of both streams are integrated to determine the final actions, enabling the model to handle spatially-complex and dynamically-changing tasks under language conditions. Empirical evaluations on RLBench, the challenging COLOSSEUM benchmark, and real-world tasks demonstrate that Cortical Policy outperforms state-of-the-art baselines substantially, validating the superiority of dual-stream design for visuomotor control. Our cortex-inspired framework offers a fresh perspective for robotic manipulation and holds potential for broader application in vision-based robot control.
翻译:[译摘要] 视图变压器通过处理多视角观测来预测动作,并在机器人操作中展现出卓越性能。现有方法通常以视图特定的方式提取静态视觉表征,导致3D空间推理能力不足且缺乏动态适应性。受人类大脑整合静态与动态视图以应对这些挑战的启发,我们提出皮层策略——一种面向机器人操作的新型双流视图变压器,该模型联合推理静态视图流与动态视图流。静态视图流通过对齐从预训练3D基础模型中提取的几何一致关键点特征来增强空间理解能力。动态视图流则通过位置感知预训练的自我中心注视估计模型实现自适应调整,在计算层面复制了人类皮层背侧通路的功能。随后,两束流的互补视图表征被整合以确定最终动作,使模型能够在语言条件下处理空间复杂与动态变化的任务。在RLBench、具有挑战性的COLOSSEUM基准以及真实世界任务上的实验评估表明,皮层策略显著优于现有最优基线方法,验证了双流设计在视觉运动控制中的优越性。我们的皮层启发式框架为机器人操作提供了全新视角,并在基于视觉的机器人控制领域具有广泛的应用潜力。