3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.
翻译:三维内容生成近期因其在虚拟现实/增强现实(VR/AR)与具身人工智能中的关键应用而受到广泛研究关注。本文致力于解决从单张场景图像中合成多个三维资产的挑战性任务。具体而言,我们的贡献包括四个方面:(i)我们提出了SceneGen,一种新颖的框架,该框架以场景图像及对应的物体掩码作为输入,同步生成具有几何结构与纹理的多个三维资产。值得注意的是,SceneGen无需额外的优化过程或资产检索即可运行;(ii)我们引入了一种新颖的特征聚合模块,该模块在特征提取模块中整合了来自视觉与几何编码器的局部与全局场景信息。结合位置预测头,该设计使得三维资产及其相对空间位置能够在单次前馈过程中生成;(iii)我们展示了SceneGen可直接扩展至多图像输入场景。尽管仅使用单图像输入进行训练,我们的架构在提供多张图像时仍能实现更优的生成性能;(iv)大量的定量与定性评估验证了我们方法的高效性与鲁棒性。我们相信这一范式为高质量三维内容生成提供了创新解决方案,有望推动其在下游任务中的实际应用。代码与模型将公开于:https://mengmouxu.github.io/SceneGen。