Recent breakthroughs in text-guided image generation have led to remarkable progress in the field of 3D synthesis from text. By optimizing neural radiance fields (NeRF) directly from text, recent methods are able to produce remarkable results. Yet, these methods are limited in their control of each object's placement or appearance, as they represent the scene as a whole. This can be a major issue in scenarios that require refining or manipulating objects in the scene. To remedy this deficit, we propose a novel GlobalLocal training framework for synthesizing a 3D scene using object proxies. A proxy represents the object's placement in the generated scene and optionally defines its coarse geometry. The key to our approach is to represent each object as an independent NeRF. We alternate between optimizing each NeRF on its own and as part of the full scene. Thus, a complete representation of each object can be learned, while also creating a harmonious scene with style and lighting match. We show that using proxies allows a wide variety of editing options, such as adjusting the placement of each independent object, removing objects from a scene, or refining an object. Our results show that Set-the-Scene offers a powerful solution for scene synthesis and manipulation, filling a crucial gap in controllable text-to-3D synthesis.
翻译:近期的文本引导图像生成突破推动了文本到三维合成领域的显著进展。通过直接从文本优化神经辐射场(NeRF),现有方法已能生成令人瞩目的结果。然而,这些方法将场景作为整体进行表征,限制了每个物体位置或外观的可控性。在需要精调或操作场景中物体的场景中,这一问题尤为突出。为解决这一缺陷,我们提出了一种新颖的全局-局部训练框架,通过物体代理(proxy)合成三维场景。代理定义了生成场景中物体的位置,并可选择性地定义其粗粒度几何形状。该方法的核心是将每个物体表示为独立的NeRF。我们交替对每个NeRF进行单独优化及作为完整场景组成部分的联合优化,从而在习得每个物体完整表征的同时,确保场景整体在风格与光照上的和谐统一。实验表明,使用代理可实现多种编辑操作,例如调整独立物体的位置、从场景移除物体或精调物体。研究结果表明,Set-the-Scene为场景合成与操控提供了强大解决方案,填补了可控文本到三维合成领域的关键空白。