Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image priors with 3D representation methods, e.g., 3D Gaussian Splatting (3D GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to feed-forward generation for any unseen text prompts, which yet remains challenging. An obstacle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end feed-forward approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the spatial feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts. The code is available in the project page.
翻译:文本到3D合成技术近期通过将文本到图像先验与三维表示方法(例如三维高斯溅射)结合,借助分数蒸馏采样取得了引人注目的进展。然而,现有方法存在一个障碍:针对单个三维物体的逐提示优化效率低下。因此,亟需实现从逐提示优化到前馈生成范式的转变,以处理任意未见文本提示,但这仍具挑战性。一个关键障碍在于如何直接生成包含数百万个三维高斯的集合来表示三维物体。本文提出BrightDreamer,一种端到端前馈方法,能够实现泛化性强且快速(77毫秒)的文本到3D生成。我们的核心思想是将生成过程建模为从具有预定义位置的锚定形状估计三维形变。为此,我们首先提出文本引导形状形变网络来预测形变后的形状及其新位置,作为三维高斯的中心(一种属性)。为估计其余四种属性(即缩放、旋转、不透明度与球谐系数),我们进一步设计了一种新颖的文本引导三平面生成器,用于生成三维物体的三平面表示。每个高斯的中心使我们能够将空间特征转换为这四种属性。生成的三维高斯最终能以每秒705帧的速度进行渲染。大量实验证明了本方法相较于现有方法的优越性。此外,BrightDreamer即使对复杂文本提示也展现出强大的语义理解能力。代码已在项目页面公开。