Gaze prediction plays a critical role in Virtual Reality (VR) applications by reducing sensor-induced latency and enabling computationally demanding techniques such as foveated rendering, which rely on anticipating user attention. However, direct eye tracking is often unavailable due to hardware limitations or privacy concerns. To address this, we present a novel gaze prediction framework that combines Head-Mounted Display (HMD) motion signals with visual saliency cues derived from video frames. Our method employs UniSal, a lightweight saliency encoder, to extract visual features, which are then fused with HMD motion data and processed through a time-series prediction module. We evaluate two lightweight architectures, TSMixer and LSTM, for forecasting future gaze directions. Experiments on the EHTask dataset, along with deployment on commercial VR hardware, show that our approach consistently outperforms baselines such as Center-of-HMD and Mean Gaze. These results demonstrate the effectiveness of predictive gaze modeling in reducing perceptual lag and enhancing natural interaction in VR environments where direct eye tracking is constrained.
翻译:注视预测在虚拟现实(VR)应用中扮演着关键角色,它通过降低传感器引发的延迟,并支持计算密集的技术(如注视点渲染)——这些技术依赖于对用户注意力的预判。然而,由于硬件限制或隐私顾虑,直接的眼动追踪往往无法实现。为此,我们提出了一种新颖的注视预测框架,该框架结合了头戴式显示器(HMD)的运动信号与从视频帧中提取的视觉显著性线索。我们的方法采用轻量级显著性编码器UniSal来提取视觉特征,随后将这些特征与HMD运动数据融合,并通过一个时间序列预测模块进行处理。我们评估了两种轻量级架构——TSMixer和LSTM——用于预测未来的注视方向。在EHTask数据集上的实验,以及在商用VR硬件上的部署结果表明,我们的方法在性能上持续优于基线方法,如HMD中心点预测和平均注视预测。这些结果证明了在直接眼动追踪受限的VR环境中,预测性注视建模在减少感知延迟和增强自然交互方面的有效性。