Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
翻译:视觉-语言-动作(VLA)模型在机器人操控中展现出强大的泛化能力,但本质上仍以反应式二维推理为主,这使得它们在需要精确三维推理的任务中表现不可靠。我们提出GeoPredict,一种几何感知的VLA框架,通过引入预测性运动学和几何先验来增强连续动作策略。GeoPredict包含一个轨迹级模块,用于编码运动历史并预测机器人手臂的多步三维关键点轨迹,以及一个预测性三维高斯几何模块,该模块沿着未来关键点轨迹通过跟踪引导的细化技术预测工作空间几何结构。这些预测模块仅通过基于深度的渲染作为训练时监督信号,而推理时仅需轻量级额外查询令牌,无需调用任何三维解码。在RoboCasa Human-50、LIBERO及真实世界操控任务上的实验表明,GeoPredict始终优于强VLA基线,尤其在几何密集型和高空间需求场景中表现突出。