A General Framework to Boost 3D GS Initialization for Text-to-3D Generation by Lexical Richness

Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting. In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, e.g., "a dog", not for lexically richer texts, e.g., "a dog is sitting on the top of the airplane". To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position's occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexically simple, medium, and hard texts. Also, our framework can be seamlessly plugged into SoTA training frameworks, e.g., LucidDreamer, for semantically consistent text-to-3D generation.

翻译：文本到3D内容生成近期备受关注，尤其是在3D高斯溅射技术普及的背景下。一般而言，基于GS的方法包含两个关键阶段：初始化和渲染优化。为实现初始化，现有工作通常直接采用随机球体初始化或应用3D扩散模型（例如Point-E）来获取初始形状。然而，此类策略存在两个关键且具有挑战性的问题：1）即使在训练后，最终形状仍与初始形状高度相似；2）仅能根据简单文本（例如“一只狗”）生成形状，而无法处理词汇更丰富的文本（例如“一只狗坐在飞机顶部”）。为解决这些问题，本文提出了一种新颖的通用框架，旨在基于词汇丰富度提升文本到3D生成中的3D GS初始化效果。我们的核心思想是将3D高斯聚合成空间均匀的体素以表示复杂形状，同时实现3D高斯之间的空间交互以及高斯与文本之间的语义交互。具体而言，我们首先构建一种体素化表示，其中每个体素包含一个3D高斯，其位置、尺度和旋转参数固定，仅将不透明度设置为决定位置占用的唯一因素。随后，我们设计了一个主要由两个新颖组件构成的初始化网络：1）全局信息感知模块与2）高斯-文本融合模块。该设计使每个3D高斯能够吸收来自其他区域的空间信息以及来自文本的语义信息。大量实验表明，在处理词汇复杂度为简单、中等及困难的文本时，我们的框架在高质量3D GS初始化方面优于现有方法（例如Shap-E）。此外，本框架可无缝集成至最先进的训练框架（例如LucidDreamer）中，实现语义一致的文本到3D生成。