We present VoxScene, a novel anchor-conditioned voxel diffusion framework tailored for 3D scene synthesis. Current data-driven layout generation techniques typically rely on bounding proxies or implicit representations, which overlook volumetric structures. This geometric blindness inevitably leads to severe physical collisions and structural entanglement, particularly in densely populated environments. To overcome these limitations, we shift the paradigm to an explicit, object-centric voxel representation. Our pipeline sequentially synthesizes discrete volumetric occupancies conditioned on prior anchors and local context. By exploiting the mutually exclusive nature of discrete voxels, our approach eliminates spatial ambiguities and guarantees collision-free arrangements, even in highly complex environments. Furthermore, the synthesized high-fidelity voxel grids serve as discriminative geometric queries for downstream asset retrieval. Extensive experiments demonstrate the universality of our method, achieving state-of-the-art physical plausibility and unlocking shape diversity compared to existing layout planners.
翻译:我们提出VoxScene,一种专为三维场景合成设计的全新锚点条件体素扩散框架。当前数据驱动的布局生成技术通常依赖边界代理或隐式表示,忽略了体素化结构。这种几何盲目性不可避免地导致严重物理碰撞与结构纠缠,尤其在密集场景中。为克服这些局限,我们转向显式、以对象为中心的体素表示范式。我们的流程基于先验锚点与局部上下文,顺序合成离散体积占有率。通过利用离散体素的互斥特性,该方法消除了空间歧义,即使在高度复杂环境中也能保证无碰撞布局。此外,合成的高保真体素网格可作为下游资产检索的判别性几何查询。大量实验证明了我们方法的普适性,与现有布局规划器相比,实现了最先进的物理合理性并解锁了形状多样性。