WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang,Lingdong Kong,Tianyi Yan,Hongsi Liu,Wesley Yang,Ziqi Huang,Wei Yin,Jialong Zuo,Yixuan Hu,Dekai Zhu,Dongyue Lu,Youquan Liu,Guangfeng Jiang,Linfeng Li,Xiangtai Li,Long Zhuo,Lai Xing Ng,Benoit R. Cottereau,Changxin Gao,Liang Pan,Wei Tsang Ooi,Ziwei Liu

from arxiv, CVPR 2026 Oral Presentation; 80 pages, 37 figures, 29 tables; Project Page at https://worldbench.github.io/worldlens GitHub at https://github.com/worldbench/WorldLens

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

翻译：生成式世界模型正在重塑具身人工智能，使智能体能够合成看起来逼真但往往在物理或行为上失效的真实4D驾驶环境。尽管进展迅速，该领域仍缺乏统一的方法来评估生成的场景是否保持几何结构、遵循物理规律或支持可靠控制。我们提出了WorldLens，这是一个全频谱基准测试，用于评估模型在其生成世界中的构建、理解与行为表现。该基准涵盖五个维度——生成、重建、动作跟随、下游任务与人类偏好——共同覆盖视觉真实性、几何一致性、物理合理性与功能可靠性。在这些维度上，现有世界模型均未表现出全面优势：纹理丰富的模型常违反物理规律，而几何稳定的模型则缺乏行为保真度。为将客观指标与人类判断对齐，我们进一步构建了WorldLens-26K，这是一个大规模人工标注视频数据集，包含数值评分与文本解释，并开发了WorldLens-Agent——一个基于这些标注蒸馏出的评估模型，以实现可扩展、可解释的评分。基准、数据集与智能体共同构成了一个用于测量世界保真度的统一生态系统，规范了未来模型评判标准：不仅依据其呈现的真实感，更依据其行为真实性。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述 | 具身视觉语言导航：系统综述与真实世界评测

专知会员服务

16+阅读 · 7月15日

【综述】世界模型：架构、方法、推理与应用全景

专知会员服务

34+阅读 · 6月2日

《图世界模型：概念、分类体系与未来方向》

专知会员服务

22+阅读 · 5月1日

具身智能中的世界模型：全面综述

专知会员服务

54+阅读 · 2025年10月21日