Generalizing beyond the training domain in image-based behavior cloning remains challenging. Existing methods address individual axes of generalization, workspace shifts, viewpoint changes, and cross-embodiment transfer, yet they are typically developed in isolation and often rely on complex pipelines. We introduce PALM (Perception Alignment for Local Manipulation), which leverages the invariance of local action distributions between out-of-distribution (OOD) and demonstrated domains to address these OOD shifts concurrently, without additional input modalities, model changes, or data collection. PALM modularizes the manipulation policy into coarse global components and a local policy for fine-grained actions. We reduce the discrepancy between in-domain and OOD inputs at the local policy level by enforcing local visual focus and consistent proprioceptive representation, allowing the policy to retrieve invariant local actions under OOD conditions. Experiments show that PALM limits OOD performance drops to 8% in simulation and 24% in the real world, compared to 45% and 77% for baselines.
翻译:基于图像的行为克隆在训练域之外的泛化仍然具有挑战性。现有方法分别处理泛化的各个维度,如工作空间偏移、视角变化和跨具身迁移,但这些方法通常是孤立开发的,并且常常依赖于复杂的处理流程。我们提出了PALM(面向局部操作的感知对齐),它利用分布外(OOD)域与演示域之间局部动作分布的不变性,来同时应对这些OOD偏移,而无需额外的输入模态、模型修改或数据收集。PALM将操作策略模块化为粗略的全局组件和用于细粒度动作的局部策略。我们通过在局部策略层面强制局部视觉聚焦和一致的本体感觉表征,来减少域内输入与OOD输入之间的差异,从而使策略能够在OOD条件下检索到不变的局部动作。实验表明,与基线方法45%和77%的OOD性能下降相比,PALM将OOD性能下降限制在模拟环境中8%,真实世界中24%。