Appearance-based gaze estimation has been actively studied in recent years. However, its generalization performance for unseen head poses is still a significant limitation for existing methods. This work proposes a generalizable multi-view gaze estimation task and a cross-view feature fusion method to address this issue. In addition to paired images, our method takes the relative rotation matrix between two cameras as additional input. The proposed network learns to extract rotatable feature representation by using relative rotation as a constraint and adaptively fuses the rotatable features via stacked fusion modules. This simple yet efficient approach significantly improves generalization performance under unseen head poses without significantly increasing computational cost. The model can be trained with random combinations of cameras without fixing the positioning and can generalize to unseen camera pairs during inference. Through experiments using multiple datasets, we demonstrate the advantage of the proposed method over baseline methods, including state-of-the-art domain generalization approaches. The code will be available at https://github.com/ut-vision/Rot-MVGaze.
翻译:基于外观的注视估计近年来受到广泛关注。然而,现有方法在应对未见头部姿态时的泛化性能仍存在显著局限。本文提出了一种可泛化的多视角注视估计任务及对应的跨视角特征融合方法以解决该问题。除成对图像外,本方法将两个相机间的相对旋转矩阵作为额外输入。所提出的网络通过学习利用相对旋转作为约束提取可旋转特征表示,并通过堆叠融合模块自适应融合可旋转特征。这种简洁高效的方法在未显著增加计算成本的前提下,显著提升了模型在未见头部姿态下的泛化性能。该模型可通过随机组合相机进行训练(无需固定相机位姿),并在推理阶段泛化至未见相机对。通过多数据集实验,我们证明了本方法相较于基线方法(包括最先进的域泛化方法)的优越性。代码将发布于 https://github.com/ut-vision/Rot-MVGaze。