We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
翻译:我们介绍了Genie Envisioner(GE),一个面向机器人操作的一体化世界基础平台,它将策略学习、评估与仿真集成在一个单一的视频生成框架内。其核心是GE-Base,这是一个大规模、指令条件化的视频扩散模型,它在一个结构化的潜在空间中捕捉了真实世界机器人交互的空间、时间和语义动态。在此基础上,GE-Act通过一个轻量级的流匹配解码器,将潜在表示映射为可执行的动作轨迹,从而能够在最小监督下,跨不同具身形态实现精确且可泛化的策略推断。为了支持可扩展的评估与训练,GE-Sim作为一个动作条件化的神经模拟器,为闭环策略开发提供高保真的推演。该平台还配备了EWMBench,这是一个衡量视觉保真度、物理一致性和指令-动作对齐度的标准化基准测试套件。这些组件共同确立了Genie Envisioner作为一个可扩展且实用的基础平台,用于指令驱动的通用具身智能。所有代码、模型和基准测试都将公开发布。