Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.
翻译:生成式世界模型正在重塑具身人工智能,使智能体能够合成看起来逼真但往往在物理或行为上失效的真实4D驾驶环境。尽管进展迅速,该领域仍缺乏统一的方法来评估生成的场景是否保持几何结构、遵循物理规律或支持可靠控制。我们提出了WorldLens,这是一个全频谱基准测试,用于评估模型在其生成世界中的构建、理解与行为表现。该基准涵盖五个维度——生成、重建、动作跟随、下游任务与人类偏好——共同覆盖视觉真实性、几何一致性、物理合理性与功能可靠性。在这些维度上,现有世界模型均未表现出全面优势:纹理丰富的模型常违反物理规律,而几何稳定的模型则缺乏行为保真度。为将客观指标与人类判断对齐,我们进一步构建了WorldLens-26K,这是一个大规模人工标注视频数据集,包含数值评分与文本解释,并开发了WorldLens-Agent——一个基于这些标注蒸馏出的评估模型,以实现可扩展、可解释的评分。基准、数据集与智能体共同构成了一个用于测量世界保真度的统一生态系统,规范了未来模型评判标准:不仅依据其呈现的真实感,更依据其行为真实性。