Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($π_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $π_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.
翻译:近年来,端到端机器人操作模型因其泛化性和可扩展性受到广泛关注。然而,当使用固定摄像机进行训练时,这些模型通常对摄像机视角变化的鲁棒性较差。本文提出VistaBot——一种将前馈几何模型与视频扩散模型相结合的新型框架,无需测试阶段摄像机标定即可实现视角鲁棒的闭环操作。我们的方法包含三个关键组成部分:4D几何估计、视图合成潜变量提取和潜空间动作学习。VistaBot被集成到动作分块(ACT)和基于扩散的(π₀)策略中,并在仿真和真实世界任务中进行了评估。我们进一步引入视角泛化分数(VGS)作为跨视角泛化综合评估的新指标。结果表明,与ACT和π₀相比,VistaBot在VGS上分别提升了2.79倍和2.63倍,同时实现了高质量的新视角合成。我们的贡献包括:几何感知合成模型、潜空间动作规划器、新基准指标以及跨多种环境的广泛验证。代码和模型将公开发布。