LucidDreaming: Controllable Object-Centric 3D Generation

With the recent development of generative models, Text-to-3D generations have also seen significant growth. Nonetheless, achieving precise control over 3D generation continues to be an arduous task, as using text to control often leads to missing objects and imprecise locations. Contemporary strategies for enhancing controllability in 3D generation often entail the introduction of additional parameters, such as customized diffusion models. This often induces hardness in adapting to different diffusion models or creating distinct objects. In this paper, we present LucidDreaming as an effective pipeline capable of fine-grained control over 3D generation. It requires only minimal input of 3D bounding boxes, which can be deduced from a simple text prompt using a Large Language Model. Specifically, we propose clipped ray sampling to separately render and optimize objects with user specifications. We also introduce object-centric density blob bias, fostering the separation of generated objects. With individual rendering and optimizing of objects, our method excels not only in controlled content generation from scratch but also within the pre-trained NeRF scenes. In such scenarios, existing generative approaches often disrupt the integrity of the original scene, and current editing methods struggle to synthesize new content in empty spaces. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks, and achieves superior alignment of 3D content when compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability.

翻译：随着生成模型的近期发展，文本到3D的生成也取得了显著进展。然而，实现对3D生成过程的精准控制仍是一项艰巨任务，因为单纯依赖文本控制常导致对象缺失和位置不精确。当前增强3D生成可控性的策略通常需要引入额外参数，例如定制化扩散模型，这往往增加了适应不同扩散模型或创建不同对象的难度。本文提出的LucidDreaming作为一种高效流水线，能够实现对3D生成的细粒度控制。该方法仅需少量3D边界框输入，这些边界框可通过大语言模型从简单文本提示中推导得出。具体而言，我们提出了裁剪射线采样技术，用于根据用户规格分别渲染和优化对象。同时引入以对象为中心的密度团块偏置，促进生成对象之间的分离。通过对象的独立渲染与优化，我们的方法不仅擅长从零开始生成受控内容，还能在预训练的NeRF场景中发挥作用。在现有生成式方法常破坏原始场景完整性、当前编辑方法难以在空白区域合成新内容的情况下，我们展示了该方法在一系列主流的基于分数蒸馏采样的3D生成框架中的卓越适应性，并与基线方法相比实现了更优的3D内容对齐。我们还提供了一个包含3D边界框的提示数据集，用于评估3D空间可控性。