Predicting human scanpaths when exploring panoramic videos is a challenging task due to the spherical geometry and the multimodality of the input, and the inherent uncertainty and diversity of the output. Most previous methods fail to give a complete treatment of these characteristics, and thus are prone to errors. In this paper, we present a simple new criterion for scanpath prediction based on principles from lossy data compression. This criterion suggests minimizing the expected code length of quantized scanpaths in a training set, which corresponds to fitting a discrete conditional probability model via maximum likelihood. Specifically, the probability model is conditioned on two modalities: a viewport sequence as the deformation-reduced visual input and a set of relative historical scanpaths projected onto respective viewports as the aligned path input. The probability model is parameterized by a product of discretized Gaussian mixture models to capture the uncertainty and the diversity of scanpaths from different users. Most importantly, the training of the probability model does not rely on the specification of "ground-truth" scanpaths for imitation learning. We also introduce a proportional-integral-derivative (PID) controller-based sampler to generate realistic human-like scanpaths from the learned probability model. Experimental results demonstrate that our method consistently produces better quantitative scanpath results in terms of prediction accuracy (by comparing to the assumed "ground-truths") and perceptual realism (through machine discrimination) over a wide range of prediction horizons. We additionally verify the perceptual realism improvement via a formal psychophysical experiment and the generalization improvement on several unseen panoramic video datasets.
翻译:预测人类在全景视频中的扫描路径是一项具有挑战性的任务,原因在于球面几何与输入的多模态特性,以及输出固有的不确定性和多样性。以往的大多数方法未能对这些特性进行完整处理,因此容易产生误差。本文基于有损数据压缩原理,提出了一种简洁的扫描路径预测新准则。该准则建议最小化训练集中量化扫描路径的期望码长,这相当于通过极大似然估计拟合一个离散条件概率模型。具体而言,该概率模型以两种模态为条件:作为降变形视觉输入的视口序列,以及作为对齐路径输入的投影到各自视口的相对历史扫描路径集。概率模型由离散化高斯混合模型的乘积参数化,以捕捉不同用户扫描路径的不确定性和多样性。最重要的是,该概率模型的训练不依赖于指定"真实"扫描路径进行模仿学习。我们还引入了一种基于比例-积分-微分(PID)控制器的采样器,从学习到的概率模型中生成逼真的人眼扫描路径。实验结果表明,在广泛的预测时域内,我们的方法在预测精度(通过与假设的"真实值"对比)和感知真实性(通过机器判别)方面均能持续生成更优的量化扫描路径结果。我们进一步通过正式的心理物理学实验验证了感知真实性的提升,并在多个未见过的全景视频数据集上验证了泛化能力的提升。