We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.
翻译:我们提出SceneNAT,一种单阶段掩码非自回归Transformer模型,能够仅通过少量并行解码步骤,从自然语言指令合成完整的三维室内场景,在性能与效率上均优于现有先进方法。SceneNAT通过对语义与空间属性的完全离散化表示进行掩码建模训练。通过在属性级和实例级同时应用掩码策略,该模型能更好地捕捉对象内部与对象间的结构关系。为增强关系推理能力,SceneNAT采用专用的三元组预测器,通过将一组可学习的关系查询映射至稀疏的符号化三元组(主体、谓词、客体)来建模场景布局与物体关系。在3D-FRONT数据集上的大量实验表明,SceneNAT在语义合规性与空间布局准确性方面均优于当前最先进的自回归与扩散基线模型,同时显著降低了计算成本。