In this letter, we propose a new method, Multi-Clue Gaze (MCGaze), to facilitate video gaze estimation via capturing spatial-temporal interaction context among head, face, and eye in an end-to-end learning way, which has not been well concerned yet. The main advantage of MCGaze is that the tasks of clue localization of head, face, and eye can be solved jointly for gaze estimation in a one-step way, with joint optimization to seek optimal performance. During this, spatial-temporal context exchange happens among the clues on the head, face, and eye. Accordingly, the final gazes obtained by fusing features from various queries can be aware of global clues from heads and faces, and local clues from eyes simultaneously, which essentially leverages performance. Meanwhile, the one-step running way also ensures high running efficiency. Experiments on the challenging Gaze360 dataset verify the superiority of our proposition. The source code will be released at https://github.com/zgchen33/MCGaze.
翻译:本文提出一种名为Multi-Clue Gaze (MCGaze)的新方法,通过以端到端学习方式捕捉头部、面部和眼部之间的时空交互上下文来促进视频视线估计,该问题尚未得到充分关注。MCGaze的主要优势在于,头部、面部和眼部的线索定位任务可在单一步骤中联合求解以实现视线估计,并通过联合优化寻求最优性能。在此过程中,头部、面部和眼部线索之间会发生时空上下文交换。因此,通过融合来自不同查询特征得到的最终视线,能够同时感知来自头部和面部的全局线索以及来自眼部的局部线索,这从根本上提升了性能。同时,单步运行方式也确保了高运行效率。在具有挑战性的Gaze360数据集上的实验验证了本方法的优越性。源代码将在https://github.com/zgchen33/MCGaze 发布。