PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting

As text-conditioned diffusion models (DMs) achieve breakthroughs in image, video, and 3D generation, the research community's focus has shifted to the more challenging task of text-to-4D synthesis, which introduces a temporal dimension to generate dynamic 3D objects. In this context, we identify Score Distillation Sampling (SDS), a widely used technique for text-to-3D synthesis, as a significant hindrance to text-to-4D performance due to its Janus-faced and texture-unrealistic problems coupled with high computational costs. In this paper, we propose \textbf{P}ixel-\textbf{L}evel \textbf{A}lignments for Text-to-\textbf{4D} Gaussian Splatting (\textbf{PLA4D}), a novel method that utilizes text-to-video frames as explicit pixel alignment targets to generate static 3D objects and inject motion into them. Specifically, we introduce Focal Alignment to calibrate camera poses for rendering and GS-Mesh Contrastive Learning to distill geometry priors from rendered image contrasts at the pixel level. Additionally, we develop Motion Alignment using a deformation network to drive changes in Gaussians and implement Reference Refinement for smooth 4D object surfaces. These techniques enable 4D Gaussian Splatting to align geometry, texture, and motion with generated videos at the pixel level. Compared to previous methods, PLA4D produces synthesized outputs with better texture details in less time and effectively mitigates the Janus-faced problem. PLA4D is fully implemented using open-source models, offering an accessible, user-friendly, and promising direction for 4D digital content creation. Our project page: https://github.com/MiaoQiaowei/PLA4D.github.io.

翻译：随着文本条件扩散模型在图像、视频和3D生成领域取得突破，研究界的焦点已转向更具挑战性的文本到4D合成任务，该任务通过引入时间维度来生成动态3D物体。在此背景下，我们发现广泛应用于文本到3D合成的分数蒸馏采样技术，因其存在的多面性和纹理不真实问题以及高昂的计算成本，严重阻碍了文本到4D的性能表现。本文提出面向文本到4D高斯溅射的像素级对齐方法，这是一种利用文本到视频帧作为显式像素对齐目标来生成静态3D物体并为其注入运动的新方法。具体而言，我们引入焦点对齐来校准渲染所需的相机位姿，并通过GS-网格对比学习从像素级的渲染图像对比中提取几何先验。此外，我们开发了基于变形网络的运动对齐来驱动高斯属性的变化，并实施参考细化以获得平滑的4D物体表面。这些技术使4D高斯溅射能够在像素级别实现几何、纹理和运动与生成视频的对齐。与现有方法相比，PLA4D能够在更短时间内生成具有更佳纹理细节的合成输出，并有效缓解多面性问题。PLA4D完全基于开源模型实现，为4D数字内容创作提供了一个易于使用、用户友好且前景广阔的技术方向。项目页面：https://github.com/MiaoQiaowei/PLA4D.github.io。