Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.
翻译:视觉-语言-动作(VLA)模型驱动着下一代自主系统,但其训练需要从复杂环境中获取可扩展的高质量标注。当前云流水线依赖通用视觉语言模型(VLM),由于这些模型基于二维图像-文本预训练,缺乏几何推理能力和领域语义。为解决这一不匹配问题,我们提出XEmbodied——一种云端基础模型,它赋予VLM内在的三维几何感知能力以及与物理线索(如占据网格、三维框)的交互能力。XEmbodied并非将几何信息视为辅助输入,而是通过结构化三维适配器整合几何表征,并利用高效图像-具身适配器将物理信号提炼为上下文标记。通过渐进式领域课程学习与强化学习后训练,XEmbodied在保持通用能力的同时,在18个公开基准上展现出稳健性能。该模型在空间推理、交通语义、具身可供性以及面向大规模场景挖掘与具身视觉问答的分布外泛化方面取得显著提升。