Despite the recent development of learning-based gaze estimation methods, most methods require one or more eye or face region crops as inputs and produce a gaze direction vector as output. Cropping results in a higher resolution in the eye regions and having fewer confounding factors (such as clothing and hair) is believed to benefit the final model performance. However, this eye/face patch cropping process is expensive, erroneous, and implementation-specific for different methods. In this paper, we propose a frame-to-gaze network that directly predicts both 3D gaze origin and 3D gaze direction from the raw frame out of the camera without any face or eye cropping. Our method demonstrates that direct gaze regression from the raw downscaled frame, from FHD/HD to VGA/HVGA resolution, is possible despite the challenges of having very few pixels in the eye region. The proposed method achieves comparable results to state-of-the-art methods in Point-of-Gaze (PoG) estimation on three public gaze datasets: GazeCapture, MPIIFaceGaze, and EVE, and generalizes well to extreme camera view changes.
翻译:尽管近年来基于学习的凝视估计方法取得了进展,但大多数方法仍需输入一个或多个眼部或面部区域裁剪图像,并输出凝视方向向量。通过裁剪提高眼部区域分辨率并减少无关因素(如衣物和头发),被认为有助于提升模型最终性能。然而,这种眼部/面部区域裁剪过程成本高昂、易出错,且不同方法的具体实现各不相同。本文提出一种帧到凝视网络,无需任何面部或眼部裁剪,可直接从相机原始帧中预测三维凝视原点和三维凝视方向。实验表明,尽管原始降采样帧(从FHD/HD降至VGA/HVGA分辨率)的眼部区域像素极少,直接回归凝视方向仍具有可行性。所提方法在GazeCapture、MPIIFaceGaze和EVE三个公共凝视数据集上,针对凝视点估计任务取得了与最先进方法相当的结果,且对不同相机视角变化具有良好的泛化能力。