Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which creates prohibitively long token sequences. To address this, we introduce Triage, a training-free, plug-and-play framework that reframes video reasoning as a resource allocation problem via hierarchical visual budgeting. Its first stage, Frame-Level Budgeting, identifies keyframes by evaluating their visual dynamics and relevance, generating a strategic prior based on their importance scores. Guided by this prior, the second stage, Token-Level Budgeting, allocates tokens in two phases: it first secures high-relevance Core Tokens, followed by diverse Context Tokens selected with an efficient batched Maximal Marginal Relevance (MMR) algorithm. Extensive experiments demonstrate that Triage improves inference speed and reduces memory footprint, while maintaining or surpassing the performance of baselines and other methods on various video reasoning benchmarks.
翻译:视觉语言模型(VLMs)在处理视频时面临巨大的计算挑战,主要源于海量数据冗余导致过长的令牌序列。为解决这一问题,我们提出了Triage,一个无需训练、即插即用的框架,通过分层视觉预算将视频推理重新定义为资源分配问题。其第一阶段,帧级预算,通过评估帧的视觉动态性和相关性来识别关键帧,并基于其重要性分数生成策略先验。在此先验指导下,第二阶段,令牌级预算,分两个阶段分配令牌:首先确保高相关性的核心令牌,随后通过高效的批量最大边际相关性(MMR)算法选取多样化的上下文令牌。大量实验表明,Triage在多种视频推理基准测试中,不仅保持或超越了基线及其他方法的性能,同时显著提升了推理速度并降低了内存占用。