Viewport prediction is a crucial aspect of tile-based 360 video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.
翻译:视口预测是面向瓦片化360度视频流系统中至关重要的环节。然而,现有基于轨迹的方法缺乏鲁棒性,且过度简化了不同模态输入间的信息构建与融合过程,导致误差累积问题。本文提出一种基于瓦片分类的视口预测方法——多模态融合Transformer(Multi-modal Fusion Transformer, MFTR)。具体而言,MFTR利用基于Transformer的网络提取各模态内的长程依赖关系,进而挖掘模态内与模态间关系,以捕捉用户历史输入与视频内容对未来视口选择的联合影响。此外,MFTR将未来瓦片划分为感兴趣与不感兴趣两类,并将包含最多用户感兴趣瓦片的区域选为未来视口。相较于预测头部轨迹,基于瓦片二分类结果选择未来视口展现出更强的鲁棒性与可解释性。为评估所提出的MFTR,我们在广泛使用的PVS-HM与Xu-Gaze数据集上开展大量实验。结果表明,MFTR在平均预测精度与重叠率方面显著优于现有最优方法,同时具备竞争力的计算效率。