We propose a novel neural pipeline, MSGazeNet, that learns gaze representations by taking advantage of the eye anatomy information through a multistream framework. Our proposed solution comprises two components, first a network for isolating anatomical eye regions, and a second network for multistream gaze estimation. The eye region isolation is performed with a U-Net style network which we train using a synthetic dataset that contains eye region masks for the visible eyeball and the iris region. The synthetic dataset used in this stage is procured using the UnityEyes simulator, and consists of 80,000 eye images. Successive to training, the eye region isolation network is then transferred to the real domain for generating masks for the real-world eye images. In order to successfully make the transfer, we exploit domain randomization in the training process, which allows for the synthetic images to benefit from a larger variance with the help of augmentations that resemble artifacts. The generated eye region masks along with the raw eye images are then used together as a multistream input to our gaze estimation network, which consists of wide residual blocks. The output embeddings from these encoders are fused in the channel dimension before feeding into the gaze regression layers. We evaluate our framework on three gaze estimation datasets and achieve strong performances. Our method surpasses the state-of-the-art by 7.57% and 1.85% on two datasets, and obtains competitive results on the other. We also study the robustness of our method with respect to the noise in the data and demonstrate that our model is less sensitive to noisy data. Lastly, we perform a variety of experiments including ablation studies to evaluate the contribution of different components and design choices in our solution.
翻译:我们提出了一种新颖的神经管道MSGazeNet,通过多流框架利用眼部解剖学信息学习注视表征。该方案包含两个组件:第一,用于隔离解剖性眼部区域的网络;第二,用于多流注视估计的网络。眼部区域隔离采用U-Net风格网络实现,我们使用包含可见眼球和虹膜区域掩膜的合成数据集进行训练。该阶段使用的合成数据集通过UnityEyes模拟器获取,包含80,000张眼部图像。训练完成后,眼部区域隔离网络被迁移到真实域,用于生成真实世界眼部图像的掩膜。为成功实现迁移,我们在训练过程中采用域随机化技术,通过模拟伪影的增强手段使合成图像受益于更大方差。生成的眼部区域掩膜与原始眼部图像共同作为多流输入,输入到由宽残差块构成的注视估计网络。这些编码器的输出嵌入在通道维度融合后,馈入注视回归层。我们在三个注视估计数据集上评估框架并取得优异性能。我们的方法在两个数据集上分别超越现有技术7.57%和1.85%,在另一个数据集中获得具有竞争力的结果。我们还研究了方法对数据噪声的鲁棒性,证明模型对噪声数据敏感性较低。最后,我们实施包括消融研究在内的多组实验,评估不同组件和设计选择的贡献。