Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.
翻译:视频大语言模型在视频理解方面展现出强大能力,但长上下文推理中,预填充阶段仍受到大量冗余视觉标记的制约。我们在严格预算下重新审视了视频大语言模型的标记压缩问题,并识别出一个关键瓶颈,即时空信息覆盖不足。现有方法通常通过粗略的逐帧分配或场景分割引入不连续的覆盖,而标记合并可能在MRoPE风格离散 (t,h,w) 绑定下进一步扭曲时空坐标。为解决这些问题,我们提出V-CAST(视频曲率感知时空剪枝),一种面向长上下文视频推理的免训练、即插即用剪枝策略。V-CAST将标记压缩建模为轨迹逼近问题,并引入曲率引导的时间分配模块,将每帧标记预算路由至语义转折点和事件边界。它进一步采用双锚点空间选择机制,在无需注意力干预的情况下保留高熵视觉证据,同时使保留标记保持原有坐标以维持位置对齐。跨多种不同架构和规模视频大语言模型的广泛实验表明,V-CAST达到原始性能的98.6%,平均超过次优方法1.1%,并将峰值内存和总延迟分别降低至原生Qwen3-VL-8B-Instruct的86.7%和86.4%。