Recent robot foundation models largely rely on large-scale behavior cloning, which imitates expert actions but discards transferable dynamics knowledge embedded in heterogeneous embodied data. While the Unified World Model (UWM) formulation has the potential to leverage such diverse data, existing instantiations struggle to scale to foundation-level due to coarse data usage and fragmented datasets. We introduce LDA-1B, a robot foundation model that scales through universal embodied data ingestion by jointly learning dynamics, policy, and visual forecasting, assigning distinct roles to data of varying quality. To support this regime at scale, we assemble and standardize EI-30k, an embodied interaction dataset comprising over 30k hours of human and robot trajectories in a unified format. Scalable dynamics learning over such heterogeneous data is enabled by prediction in a structured DINO latent space, which avoids redundant pixel-space appearance modeling. Complementing this representation, LDA-1B employs a multi-modal diffusion transformer to handle asynchronous vision and action streams, enabling stable training at the 1B-parameter scale. Experiments in simulation and the real world show LDA-1B outperforms prior methods (e.g., $π_{0.5}$) by up to 21\%, 48\%, and 23\% on contact-rich, dexterous, and long-horizon tasks, respectively. Notably, LDA-1B enables data-efficient fine-tuning, gaining 10\% by leveraging 30\% low-quality trajectories typically harmful and discarded.
翻译:最近的机器人基础模型在很大程度上依赖于大规模行为克隆,这种方法模仿专家动作,却丢弃了异构具身数据中蕴含的可迁移动力学知识。虽然统一世界模型(UWM)的构想有潜力利用这种多样化数据,但现有的实现方式由于数据使用方式粗放和数据集碎片化,难以扩展到基础模型级别。我们提出了LDA-1B,这是一个通过通用具身数据注入实现扩展的机器人基础模型,它通过联合学习动力学、策略和视觉预测,为不同质量的数据分配了不同的角色。为了大规模支持这种机制,我们构建并标准化了EI-30k,这是一个包含超过3万小时人类和机器人轨迹的具身交互数据集,采用统一格式。在此类异构数据上进行可扩展的动力学学习,是通过在结构化的DINO潜在空间中进行预测实现的,这避免了冗余的像素空间外观建模。作为此表征的补充,LDA-1B采用了一个多模态扩散Transformer来处理异步的视觉和动作流,从而能够在10亿参数规模下实现稳定训练。仿真和现实世界的实验表明,LDA-1B在接触丰富、灵巧操作和长时程任务上,分别比先前方法(例如 $π_{0.5}$)的性能高出高达21%、48%和23%。值得注意的是,LDA-1B支持数据高效的微调,通过利用通常有害且被丢弃的30%低质量轨迹,获得了10%的性能提升。