How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse

翻译：物理世界不仅是视觉的，它还受严格的结构和过程约束。然而，当前对视觉语言模型（VLMs）的评估仍严重偏向感知真实性，优先考虑生成视觉上合理的3D布局、形状和外观。现有基准极少测试模型是否理解构建这些人工制品所需的逐步过程和物理依赖关系，而这种能力对于自动化设计到施工流水线至关重要。为此，我们引入DreamHouse，一个全新的物理生成推理基准：即同时满足几何、结构、可施工性和规范合规性约束来合成人工制品的能力。我们将该基准限定在住宅木框架建筑领域，该领域具有完全编码的工程标准和客观可验证的正确性。我们整理了涵盖13种建筑风格的超过26,000个结构，每个结构均通过施工文档标准（LOD 350）验证，并开发了一个确定性的10项测试结构验证框架。与仅评估最终输出的静态基准不同，DreamHouse支持迭代的智能体交互。模型观察中间构建状态，生成施工动作，并接收结构化环境反馈，从而实现对规划、结构推理和自我修正的细粒度评估。与最先进VLMs的大量实验揭示了现有排行榜上基本未显现的显著能力差距。这些发现确立了物理有效性作为与视觉真实性正交的关键评估轴，凸显了物理生成推理作为多模态智能中一个独特且尚未充分发展的前沿领域。项目地址：https://luluyuyuyang.github.io/dreamhouse