Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.
翻译:通用机器人策略需在遵循用户指令的同时,推理物体、相机与机器人动作在三维物理世界中的交互关系。近期视觉-语言-动作模型(VLA)与视频-世界-动作模型(WAM)虽继承了大尺度基础模型的强大语义或时间先验,但其运算仍主要基于二维图像帧或二维导出的潜在空间,难以显式表达接触密集操作所需的几何信息。为此,我们提出几何动作模型(GAM),这是一种语言条件化的操作策略,可直接复用预训练的几何基础模型(GFM)作为感知、时序预测与动作解码的共享基础。GAM在GFM中间层进行拆分:浅层作为观测编码器,并在拆分点插入因果未来预测器,基于语言、本体感知与动作历史预测未来潜在标记。这些预测标记随后通过剩余GFM模块进行特征传播与解码,使单一骨干网络同时输出未来几何与动作。该设计仅通过最小架构改动即可赋予GFM语言条件化的时序世界建模能力,同时保留其丰富的几何先验。在仿真与真实机器人操作的全套基准测试中,GAM相比现有基础模型规模的基线方法,在精度、鲁棒性、运行速度与模型轻量化方面均展现出显著优势。