With the rapid development of deep learning technology in the past decade, appearance-based gaze estimation has attracted great attention from both computer vision and human-computer interaction research communities. Fascinating methods were proposed with variant mechanisms including soft attention, hard attention, two-eye asymmetry, feature disentanglement, rotation consistency, and contrastive learning. Most of these methods take the single-face or multi-region as input, yet the basic architecture of gaze estimation has not been fully explored. In this paper, we reveal the fact that tuning a few simple parameters of a ResNet architecture can outperform most of the existing state-of-the-art methods for the gaze estimation task on three popular datasets. With our extensive experiments, we conclude that the stride number, input image resolution, and multi-region architecture are critical for the gaze estimation performance while their effectiveness dependent on the quality of the input face image. We obtain the state-of-the-art performances on three datasets with 3.64 on ETH-XGaze, 4.50 on MPIIFaceGaze, and 9.13 on Gaze360 degrees gaze estimation error by taking ResNet-50 as the backbone.
翻译:过去十年间,随着深度学习技术的快速发展,基于外观的视线估计引起了计算机视觉与人机交互研究领域的广泛关注。研究者提出了众多精妙方法,涉及软注意力、硬注意力、双眼非对称性、特征解耦、旋转一致性及对比学习等不同机制。这些方法大多以单张人脸或多区域作为输入,但视线估计的基础架构尚未得到充分探索。本文揭示了一个事实:仅通过调整ResNet架构的几个简单参数,即可在三个主流数据集上超越现有绝大多数最先进的视线估计方法。基于广泛实验,我们得出结论:步长、输入图像分辨率以及多区域架构是影响视线估计性能的关键因素,但其有效性取决于输入人脸图像的质量。以ResNet-50为骨干网络,我们在三个数据集上取得了最先进性能——ETH-XGaze上视线估计误差为3.64度,MPIIFaceGaze上为4.50度,Gaze360上为9.13度。