Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.

翻译：基于世界模型的策略评估是一种实用代理方法，通过在基于动作条件的视频扩散模型中展开候选动作来测试真实机器人控制。随着这些模型越来越多地采用潜在扩散建模（LDM），选择合适的潜在空间变得至关重要。尽管现有方法使用以像素重建为主的自动编码潜在空间（如VAE），但近期研究显示，采用具有表示对齐语义潜在空间的预训练编码器能带来益处。我们系统评估了这些用于动作条件LDM的潜在空间，通过比较六种重建编码器和语义编码器，在BridgeV2数据集上按固定协议训练世界模型变体，并展示了在有无维度压缩的高维表示空间中进行有效世界模型训练的结果。随后，我们提出评估机器人世界模型性能的三个维度：视觉保真度、规划与下游策略性能，以及潜在表示质量。结果表明，仅靠视觉保真度不足以选择世界模型。虽然VAE和Cosmos等重建编码器在像素级得分上表现优异，但V-JEPA 2.1（总体策略性能最强）、Web-DINO和SigLIP 2等语义编码器在所有模型规模下，通常在其他两个维度上表现更佳。本研究主张语义潜在空间是面向策略相关的机器人扩散世界模型的更优基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述 | 机器人操作世界模型：预测、行动接口与学习生命周期

专知会员服务

9+阅读 · 6月3日

【综述】机器人学习中的世界模型：全面综述

专知会员服务

20+阅读 · 5月4日

【伯克利博士论文】物理世界中可泛化且可扩展的机器人学习

专知会员服务

22+阅读 · 1月18日

【NeurIPS2025】语言模型是高效的推理者吗？——来自逻辑编程的视角

专知会员服务

17+阅读 · 2025年11月3日