Ultra-high-resolution 360-degree video streaming is severely constrained by the massive bandwidth required to deliver immersive experiences. Current viewport prediction techniques predominately rely on kinematics or low-level visual saliency, treating users as passive physical objects governed by inertia. This theoretical limitation leads to the "Saccade Trap" -- a critical failure mode where predictors fail to anticipate rapid, meaning-driven shifts in attention, causing rebuffering stalls exactly when user engagement is highest. To resolve this, we propose Semantically-Adaptive Conformal Tiling with Associative Lookahead, a novel framework that integrates cognitive intent into network control. Unlike "one-size-fits-all" approaches, our method utilizes an architectural inversion strategy: heavy semantic reasoning is offloaded to the server to generate lightweight association graphs, which guide a low-latency client-side controller. We construct a personalized Multi-Modal Prediction Set that dynamically tightens safety margins during stable fixation to maximize efficiency, while simultaneously pre-fetching non-adjacent tiles containing semantically linked objects (Associative Lookahead). This mechanism effectively converts the "calm" of fixation into a preparation phase for the next interaction. Trace-driven evaluation on the 360-AV-HM dataset demonstrates that this approach successfully mitigates the Saccade Trap, reducing stall duration by $\ge$ 20% and lowering effective bandwidth consumption by $\ge$ 18% compared to state-of-the-art trajectory-based baselines.
翻译:超高分辨率360度视频流媒体因传输沉浸式体验所需的海量带宽而受到严重制约。当前视口预测技术主要依赖于运动学或低层视觉显著性,将用户视为受惯性支配的被动物理对象。这一理论局限导致了"扫视陷阱"——一种关键的失效模式,即预测器无法预判由意义驱动的快速注意力转移,从而在用户参与度最高时引发卡顿。为解决此问题,我们提出"基于语义自适应共形分块与关联前瞻"的新框架,将认知意图整合到网络控制中。与"一刀切"方法不同,我们的方法采用架构反转策略:将繁重的语义推理卸载至服务器以生成轻量级关联图,进而指导低延迟的客户端控制器。我们构建了个性化的多模态预测集合,在稳定注视期间动态收紧安全边界以最大化效率,同时预取包含语义关联对象的非相邻分块(关联前瞻)。该机制有效地将注视的"平静期"转化为下一次交互的准备阶段。基于360-AV-HM数据集的轨迹驱动评估表明,相较于最先进的基于轨迹的基线方法,本方法成功缓解了扫视陷阱,将卡顿时长降低≥20%,并将有效带宽消耗减少≥18%。