LucidDreaming: Controllable Object-Centric 3D Generation

With the recent development of generative models, Text-to-3D generations have also seen significant growth, opening a door for creating video-game 3D assets from a more general public. Nonetheless, people without any professional 3D editing experience would find it hard to achieve precise control over the 3D generation, especially if there are multiple objects in the prompt, as using text to control often leads to missing objects and imprecise locations. In this paper, we present LucidDreaming as an effective pipeline capable of spatial and numerical control over 3D generation from only textual prompt commands or 3D bounding boxes. Specifically, our research demonstrates that Large Language Models (LLMs) possess 3D spatial awareness and can effectively translate textual 3D information into precise 3D bounding boxes. We leverage LLMs to get individual object information and their 3D bounding boxes as the initial step of our process. Then with the bounding boxes, We further propose clipped ray sampling and object-centric density blob bias to generate 3D objects aligning with the bounding boxes. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks and our pipeline can even used to insert objects into an existing NeRF scene. Moreover, we also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability. With extensive qualitative and quantitative experiments, we demonstrate that LucidDreaming achieves superior results in object placement precision and generation fidelity compared to current approaches, while maintaining flexibility and ease of use for non-expert users.

翻译：随着生成模型的近期发展，文本到三维生成技术也取得了显著进步，为更广泛的公众创建视频游戏三维资产打开了大门。然而，没有任何专业三维编辑经验的用户会发现难以实现对三维生成的精确控制，尤其是在提示中包含多个对象时，因为使用文本控制常常导致对象缺失和位置不精确。本文提出LucidDreaming作为一种高效流程，能够仅通过文本提示命令或三维边界框实现对三维生成的空间和数量控制。具体而言，我们的研究表明大型语言模型（LLMs）具备三维空间感知能力，并能有效将文本三维信息转化为精确的三维边界框。我们利用LLMs获取单个对象信息及其三维边界框作为流程的初始步骤。随后基于这些边界框，我们进一步提出截断光线采样和以对象为中心的密度团块偏置方法，以生成与边界框对齐的三维对象。我们证明该方法在主流基于分数蒸馏采样的三维生成框架中展现出卓越的适应性，甚至可将对象插入现有NeRF场景。此外，我们还提供了包含三维边界框的提示数据集，用于三维空间可控性的基准测试。通过大量定性与定量实验，我们证明LucidDreaming在对象放置精度和生成保真度方面优于现有方法，同时为非专业用户保持了灵活性和易用性。