Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.
翻译:从图像序列中精确重建和追踪动态人脸极具挑战性,因为非刚性形变、表情变化和视角变化同时发生,导致几何估计和对应关系估计存在显著歧义。我们提出了一种基于规范面部点预测的统一高保真4D面部重建方法,该表示法为每个像素分配共享规范空间中的归一化面部坐标。这一表述将密集追踪和动态重建转化为规范重建问题,使单一前馈模型能够生成时间一致性几何和可靠对应关系。通过联合预测深度和规范坐标,我们的方法在单一架构中实现了精确深度估计、时间稳定重建、密集3D几何以及鲁棒的面部点追踪。我们采用基于Transformer的模型实现这一方案,该模型联合预测深度和规范面部坐标,并使用通过非刚性变换映射到规范空间的多视图几何数据进行训练。在图像和视频基准上的大量实验表明,该方法在重建和追踪任务中均达到了最先进性能,相较于先前的动态重建方法,对应误差降低了约3倍,推理速度更快,同时深度精度提升了16%。这些结果凸显了规范面部点预测作为统一前馈4D面部重建的有效基础。