Text-to-3D-aware face (T3D Face) generation and manipulation is an emerging research hot spot in machine learning, which still suffers from low efficiency and poor quality. In this paper, we propose an End-to-End Efficient and Effective network for fast and accurate T3D face generation and manipulation, termed $E^3$-FaceNet. Different from existing complex generation paradigms, $E^3$-FaceNet resorts to a direct mapping from text instructions to 3D-aware visual space. We introduce a novel Style Code Enhancer to enhance cross-modal semantic alignment, alongside an innovative Geometric Regularization objective to maintain consistency across multi-view generations. Extensive experiments on three benchmark datasets demonstrate that $E^3$-FaceNet can not only achieve picture-like 3D face generation and manipulation, but also improve inference speed by orders of magnitudes. For instance, compared with Latent3D, $E^3$-FaceNet speeds up the five-view generations by almost 470 times, while still exceeding in generation quality. Our code is released at https://github.com/Aria-Zhangjl/E3-FaceNet.
翻译:文本到3D感知人脸(T3D Face)的生成与操纵是机器学习领域新兴的研究热点,但仍面临效率低下与质量不佳的问题。本文提出一种端到端高效网络,用于快速准确的T3D人脸生成与操纵,称为$E^3$-FaceNet。不同于现有复杂的生成范式,$E^3$-FaceNet采用从文本指令到3D感知视觉空间的直接映射机制。我们设计了新颖的风格代码增强器以提升跨模态语义对齐能力,并引入创新的几何正则化目标来保持多视角生成的一致性。在三个基准数据集上的大量实验表明,$E^3$-FaceNet不仅能实现逼真的3D人脸生成与操纵,还能将推理速度提升数个数量级。例如,相较于Latent3D,$E^3$-FaceNet将五视角生成速度提升近470倍,同时生成质量更优。代码已发布于https://github.com/Aria-Zhangjl/E3-FaceNet。