Appearance-based gaze estimation has been actively studied in recent years. However, its generalization performance for unseen head poses is still a significant limitation for existing methods. This work proposes a generalizable multi-view gaze estimation task and a cross-view feature fusion method to address this issue. In addition to paired images, our method takes the relative rotation matrix between two cameras as additional input. The proposed network learns to extract rotatable feature representation by using relative rotation as a constraint and adaptively fuses the rotatable features via stacked fusion modules. This simple yet efficient approach significantly improves generalization performance under unseen head poses without significantly increasing computational cost. The model can be trained with random combinations of cameras without fixing the positioning and can generalize to unseen camera pairs during inference. Through experiments using multiple datasets, we demonstrate the advantage of the proposed method over baseline methods, including state-of-the-art domain generalization approaches.
翻译:外观法视线估计近年来受到广泛研究。然而,现有方法在未知头部姿态下的泛化性能仍然是显著局限。本文提出一种可泛化的多视图视线估计任务及跨视图特征融合方法以解决该问题。该方法除成对图像外,还将两个相机间的相对旋转矩阵作为附加输入。所提出的网络通过学习利用相对旋转作为约束来提取可旋转特征表示,并通过堆叠融合模块自适应地融合可旋转特征。这种简捷高效的方法在不显著增加计算成本的前提下,显著提升了未知头部姿态下的泛化性能。模型可利用相机的随机组合进行训练(无需固定位姿),并在推理阶段泛化至未见过的相机对。通过多数据集的实验,我们证明了该方法相较于基线方法(包括最先进的域泛化方法)的优势。