Developing gaze estimation models that generalize well to unseen domains and in-the-wild conditions remains a challenge with no known best solution. This is mostly due to the difficulty of acquiring ground truth data that cover the distribution of possible faces, head poses and environmental conditions that exist in the real world. In this work, we propose to train general gaze estimation models based on 3D geometry-aware gaze pseudo-annotations which we extract from arbitrary unlabelled face images, which are abundantly available in the internet. Additionally, we leverage the observation that head, body and hand pose estimation benefit from revising them as dense 3D coordinate prediction, and similarly express gaze estimation as regression of dense 3D eye meshes. We overcome the absence of compatible ground truth by fitting rigid 3D eyeballs on existing gaze datasets and design a multi-view supervision framework to balance the effect of pseudo-labels during training. We test our method in the task of gaze generalization, in which we demonstrate improvement of up to $30\%$ compared to state-of-the-art when no ground truth data are available, and up to $10\%$ when they are. The project material will become available for research purposes.
翻译:开发能够良好泛化至未见领域及野外环境的视线估计模型仍是当前缺乏公认最优解的难题,其根本原因在于获取覆盖真实世界中面部分布、头部姿态及环境条件的真实标注数据存在困难。本研究提出通过从海量互联网无标注人脸图像中提取基于三维几何感知的视线伪标注,训练通用视线估计模型。此外,我们借鉴头部、身体及手部姿态估计通过重构为密集三维坐标预测任务而受益的经验,将视线估计同样表达为密集三维眼球网格的回归问题。通过将刚性三维眼球模型拟合至现有视线数据集,我们克服了缺乏兼容真实标注的困境,并设计多视角监督框架以平衡训练过程中伪标签的影响。在视线泛化任务中,该方法在无真实标注数据时较现有最优方法提升高达30%,在有真实标注时提升达10%。项目材料将面向研究用途开放。