Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

翻译：近期扩散模型（如ControlNet）的进展实现了几何可控、高保真的文本到图像生成。然而，现有方法均未解决在文本到三维生成中引入此类可控性的问题。为此，我们提出Text2Control3D——一种可控文本到三维化身生成方法，其面部表情可通过手持相机随意拍摄的单目视频进行控制。核心策略是在神经辐射场（NeRF）中构建三维化身，并利用ControlNet生成的一组受控视角感知图像进行优化——ControlNet的条件输入为从输入视频中提取的深度图。在生成视角感知图像时，我们通过交叉注意力机制注入具有良好可控性的参考面部表情与外观。针对经验分析中发现的视角感知图像在相同像素位置出现三维不可理解的相同纹理这一视角无关纹理问题，我们采用低通滤波器对扩散模型的高斯隐变量进行处理。最后，为使NeRF能够训练这些几何上并不严格一致的视角感知图像，我们将每幅图像的几何变化视为共享三维规范空间中的形变视图，并通过形变场表格学习每幅图像的形变，从而在可变形NeRF的规范空间中构建三维化身。本文通过实证结果验证了该方法的有效性。