World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.
翻译:基于世界模型的策略评估是一种实用代理方法,通过在基于动作条件的视频扩散模型中展开候选动作来测试真实机器人控制。随着这些模型越来越多地采用潜在扩散建模(LDM),选择合适的潜在空间变得至关重要。尽管现有方法使用以像素重建为主的自动编码潜在空间(如VAE),但近期研究显示,采用具有表示对齐语义潜在空间的预训练编码器能带来益处。我们系统评估了这些用于动作条件LDM的潜在空间,通过比较六种重建编码器和语义编码器,在BridgeV2数据集上按固定协议训练世界模型变体,并展示了在有无维度压缩的高维表示空间中进行有效世界模型训练的结果。随后,我们提出评估机器人世界模型性能的三个维度:视觉保真度、规划与下游策略性能,以及潜在表示质量。结果表明,仅靠视觉保真度不足以选择世界模型。虽然VAE和Cosmos等重建编码器在像素级得分上表现优异,但V-JEPA 2.1(总体策略性能最强)、Web-DINO和SigLIP 2等语义编码器在所有模型规模下,通常在其他两个维度上表现更佳。本研究主张语义潜在空间是面向策略相关的机器人扩散世界模型的更优基础。