This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15$\%$. The code will be publicly available at https://github.com/jdjdli/Denoise_distill_EF_gazetracker.
翻译:本文研究了利用事件数据与帧数据进行被动凝视估计的问题。考虑到生理结构的固有差异,仅基于给定状态精确估计凝视方向是困难的。因此,我们将凝视估计重新表述为从当前状态到若干预先注册的锚定状态之间状态迁移的量化问题。具体而言,我们提出了一种基于学习的两阶段凝视估计框架,该框架将整个凝视估计过程分解为从粗到精的两个步骤:锚定状态选择与最终凝视定位。此外,为提升泛化能力,我们不直接学习一个大型凝视估计网络,而是将一组局部专家网络与一个学生网络对齐,并引入一种新颖的去噪蒸馏算法,利用去噪扩散技术迭代消除事件数据中的固有噪声。大量实验证明了所提方法的有效性,其性能以15%的显著优势超越现有最优方法。代码将在 https://github.com/jdjdli/Denoise_distill_EF_gazetracker 公开。