Human gaze estimation is essential for applications such as human-computer interaction, social robotics, and assistive systems. However, achieving accurate, interpretable, and real-time performance in unconstrained environments remains challenging. Existing appearance-based methods often face trade-offs between spatial robustness, computational efficiency, and effective use of contextual information. To address this, we introduce CapStARE, a capsule-based architecture that combines a frozen ConvNeXt backbone for efficient feature extraction, capsule formation with attention-based routing for structured facial reasoning, and dual GRU decoders for lightweight sequential modeling over short-horizon observation windows. This design preserves interpretable part-whole facial relationships while improving prediction stability through local contextual consistency. Experimental results demonstrate strong performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65), while also generalizing competitively on Gaze360 (9.06), all with real-time inference (<10 ms). These findings suggest that the proposed method provides a practical and robust framework for appearance-based gaze estimation in real-world interactive environments. The related code and experimental results are publicly available at: https://github.com/toukapy/capsStare
翻译:人类目光估计在人机交互、社交机器人和辅助系统等应用中至关重要。然而,在非约束环境下实现准确、可解释且实时的性能仍具挑战性。现有基于外观的方法常在空间鲁棒性、计算效率和上下文信息有效利用之间面临权衡。为此,我们提出CapStARE——一种基于胶囊的架构,它结合了冻结ConvNeXt骨干网络用于高效特征提取、基于注意力路由的胶囊形成实现结构化面部推理,以及双GRU解码器用于短时域观测窗口内的轻量级序列建模。该设计在通过局部上下文一致性提升预测稳定性的同时,保留了可解释的部分-整体面部关系。实验结果表明,该方法在ETH-XGaze(3.36)和MPIIFaceGaze(2.65)上表现优异,同时在Gaze360(9.06)上展现出具有竞争力的泛化能力,且均实现实时推理(<10毫秒)。这些发现表明,所提方法为现实交互环境中基于外观的目光估计提供了实用且鲁棒的框架。相关代码和实验结果公开于:https://github.com/toukapy/capsStare