Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional properties essential for interaction. We propose that interactive asset generation must be rooted in functional logic and hierarchical physics. To bridge this gap, we introduce PhysForge, a decoupled two-stage framework supported by PhysDB, a large-scale dataset of 150,000 assets with four-tier physical annotations. First, a VLM acts as a "physical architect" to plan a "Hierarchical Physical Blueprint" defining material, functional, and kinematic constraints. Second, a physics-grounded diffusion model realizes this blueprint by synthesizing high-fidelity geometry alongside precise kinematic parameters via a novel KineVoxel Injection (KVI) mechanism. Experiments demonstrate that PhysForge produces functionally plausible, simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents.
翻译:合成具有物理真实性的三维资产是交互式虚拟世界和具身人工智能的关键瓶颈。现有方法主要聚焦于静态几何结构,忽略了交互所必需的功能属性。我们提出,交互式资产生成必须根植于功能逻辑与分层物理机制。为填补这一空白,我们引入PhysForge——一个解耦的两阶段框架,其底层依托PhysDB(一个包含15万个资产及四层物理标注的大规模数据集)。首先,视觉语言模型(VLM)作为"物理架构师",规划出定义材料、功能与运动学约束的"分层物理蓝图"。其次,物理驱动的扩散模型通过新颖的KineVoxel Injection (KVI)机制,在合成高保真几何结构的同时生成精确运动学参数,从而实现该蓝图。实验表明,PhysForge能生成功能合理且可直接用于仿真的资产,为交互式三维内容与具身智能体提供稳健的数据引擎。