Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have shown remarkable performances in generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pretraining in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. source code and model are available at https://github.com/ut-vision/UniGaze.
翻译:尽管在数据收集和模型架构方面已有数十年的研究,当前的视线估计模型在泛化到不同数据域时仍面临重大挑战。最近的自监督预训练方法在各种视觉任务中展现出了卓越的泛化性能。然而,它们在视线估计任务中的有效性尚未得到探索。我们提出了UniGaze,首次通过自监督预训练,利用大规模真实世界人脸数据集进行视线估计。通过系统性研究,我们阐明了对于视线估计任务有效的预训练关键因素。我们的实验表明,为语义任务设计的自监督方法在应用于视线估计时会失效,而我们精心设计的预训练流程则能持续提升跨域性能。通过对具有挑战性的跨数据集评估以及包括留一数据集和联合数据集设置在内的新颖协议进行全面实验,我们证明了UniGaze在显著提升跨多个数据域泛化能力的同时,最大限度地减少了对昂贵标注数据的依赖。源代码和模型可在 https://github.com/ut-vision/UniGaze 获取。