Viewport prediction is a crucial aspect of tile-based 360 video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.
翻译:视口预测是基于瓦片的360度视频流系统中的关键环节。然而,现有基于轨迹的方法鲁棒性不足,且过度简化了不同模态输入之间的信息构建与融合过程,导致误差累积问题。本文提出一种基于瓦片分类的多模态融合Transformer视口预测方法,即MFTR。具体而言,MFTR采用基于Transformer的网络提取各模态内的长程依赖关系,进而挖掘模态内与模态间关联,以捕获用户历史输入和视频内容对未来视口选择的联合影响。此外,MFTR将未来瓦片分为用户感兴趣与不感兴趣两类,并将包含最多用户感兴趣瓦片的区域选定为未来视口。相较于预测头部轨迹,基于瓦片二分类结果选择未来视口的方法展现出更强的鲁棒性和可解释性。为评估所提出的MFTR,我们在两个广泛使用的PVS-HM和Xu-Gaze数据集上进行了大量实验。MFTR在平均预测准确率和重叠率指标上均优于现有最先进方法,同时展现出具有竞争力的计算效率。