Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement

Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play a significant role in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, we show that the model only using monocular-reconstructed synthetic training data can perform comparably to real data with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at \url{https://github.com/ut-vision/AdaptiveGaze}.

翻译：随着深度神经网络的近期发展，基于外观的视线估计在相同域内训练和测试时取得了显著成功。与域内任务相比，不同域之间的差异导致跨域性能严重下降，阻碍了视线估计在实际应用中的部署。在所有因素中，头部姿态和视线范围被认为对视线估计的最终性能起着重要作用，但收集大范围数据成本高昂。本文提出了一种有效的模型训练流程，包括训练数据合成和用于无监督域自适应的视线估计模型。所提出的数据合成方法利用单图像三维重建来扩展源域的头部姿态范围，而无需使用三维面部形状数据集。为了弥合合成图像与真实图像之间不可避免的差距，我们进一步提出了一种适用于合成全脸数据的无监督域自适应方法。我们提出了一种解耦自编码器网络来分离与视线相关的特征，并引入背景增强一致性损失以利用合成源域的特性。通过全面实验，我们证明仅使用单目重建的合成训练数据的模型，其性能可与具有大标签范围的真实数据相媲美。我们提出的域自适应方法进一步提升了在多个目标域上的性能。代码和数据将在 \url{https://github.com/ut-vision/AdaptiveGaze} 提供。