Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.
翻译:视觉语言模型展现出强大的推理和规划能力,但将这些预测结果落地为精确的机器人动作仍是一大核心挑战。现有的视觉-语言-动作方法通常将推理与动作生成纠缠在一起,导致泛化能力有限。我们提出可泛化动作专家(GAE)——一种与任务无关的模型,能将稀疏的几何规划转换为密集的机器人动作。该方法引入了一个稀疏几何接口:视觉语言模型预测表示高层意图的稀疏3D路标点,而GAE则将这些路标点与实时点云观测一同映射为连续动作轨迹。GAE在大规模点云-轨迹数据集上进行了预训练,该数据集包含来自仿真环境和真实机器人的15万条轨迹。为进一步提升效率和泛化性,我们提出动作预训练-点云微调(APPF)方案,将动作动力学学习与几何定位解耦。预训练完成后,GAE被冻结并在下游任务中复用,仅需对视觉语言模型进行轻量级微调即可生成稀疏接口。实验表明,该方法在多样化视觉域、摄像机视角及自然语言指令下均展现出卓越的性能与泛化能力。