Interactive $360^{\circ}$ Video Streaming Using FoV-Adaptive Coding with Temporal Prediction

For $360^{\circ}$ video streaming, FoV-adaptive coding that allocates more bits for the predicted user's field of view (FoV) is an effective way to maximize the rendered video quality under the limited bandwidth. We develop a low-latency FoV-adaptive coding and streaming system for interactive applications that is robust to bandwidth variations and FoV prediction errors. To minimize the end-to-end delay and yet maximize the coding efficiency, we propose a frame-level FoV-adaptive inter-coding structure. In each frame, regions that are in or near the predicted FoV are coded using temporal and spatial prediction, while a small rotating region is coded with spatial prediction only. This rotating intra region periodically refreshes the entire frame, thereby providing robustness to both FoV prediction errors and frame losses due to transmission errors. The system adapts the sizes and rates of different regions for each video segment to maximize the rendered video quality under the predicted bandwidth constraint. Integrating such frame-level FoV adaptation with temporal prediction is challenging due to the temporal variations of the FoV. We propose novel ways for modeling the influence of FoV dynamics on the quality-rate performance of temporal predictive coding.We further develop LSTM-based machine learning models to predict the user's FoV and network bandwidth.The proposed system is compared with three benchmark systems, using real-world network bandwidth traces and FoV traces, and is shown to significantly improve the rendered video quality, while achieving very low end-to-end delay and low frame-freeze probability.

翻译：针对$360^{\circ}$视频流传输，视场自适应编码（即为预测的用户视场（FoV）分配更多比特）是在有限带宽下最大化解码视频质量的有效方法。我们开发了一种面向交互式应用的低延迟视场自适应编码与流传输系统，该系统对带宽变化和视场预测误差具有鲁棒性。为最小化端到端延迟并最大化编码效率，我们提出了一种帧级视场自适应帧间编码结构。在每一帧中，对处于或靠近预测视场区域的编码采用时间预测与空间预测，而一个较小的旋转区域仅采用空间预测编码。该旋转帧内区域周期性地刷新整个帧，从而同时提供对视场预测误差和传输错误导致的帧丢失的鲁棒性。系统根据每个视频片段的预测带宽约束，自适应调整不同区域的大小与码率，以最大化渲染视频质量。由于视场的时间动态特性，将这种帧级视场自适应与时间预测相结合具有挑战性。我们提出了创新方法来建模视场动态对时间预测编码质量-码率性能的影响，并进一步开发了基于长短期记忆网络（LSTM）的机器学习模型来预测用户视场与网络带宽。使用真实网络带宽轨迹与视场轨迹，将所提系统与三个基准系统进行对比，结果表明该系统能显著提升渲染视频质量，同时实现极低的端到端延迟与极低的帧冻结概率。