Gaze estimation, the task of predicting where an individual is looking, is a critical task with direct applications in areas such as human-computer interaction and virtual reality. Estimating the direction of looking in unconstrained environments is difficult, due to the many factors that can obscure the face and eye regions. In this work we propose CrossGaze, a strong baseline for gaze estimation, that leverages recent developments in computer vision architectures and attention-based modules. Unlike previous approaches, our method does not require a specialised architecture, utilizing already established models that we integrate in our architecture and adapt for the task of 3D gaze estimation. This approach allows for seamless updates to the architecture as any module can be replaced with more powerful feature extractors. On the Gaze360 benchmark, our model surpasses several state-of-the-art methods, achieving a mean angular error of 9.94 degrees. Our proposed model serves as a strong foundation for future research and development in gaze estimation, paving the way for practical and accurate gaze prediction in real-world scenarios.
翻译:视线估计——即预测个体注视方向的任务——是直接应用于人机交互与虚拟现实等领域的关键技术。在非受控环境中估计注视方向十分困难,这源于诸多可能遮挡面部及眼部区域的因素。本文提出CrossGaze,一种用于视线估计的强基线方法,该模型充分利用了计算机视觉架构与基于注意力机制模块的最新进展。与先前方法不同,本方法无需专用架构,而是采用已成熟的模型进行整合与适配,使其服务于三维视线估计任务。这种设计使得架构可无缝升级,任意模块均可替换为更强大的特征提取器。在Gaze360基准测试中,我们的模型以9.94度的平均角误差超越了多项当前最优方法。所提出的模型为视线估计领域的未来研究与发展奠定了坚实基础,为现实场景中实用且精准的注视预测铺平道路。