Long video understanding remains challenging for Multi-modal Large Language Models (MLLMs) due to high memory costs and context-length limits. Prior approaches mitigate this by scoring and selecting frames/tokens within short clips, but they lack a principled mechanism to (i) compare relevance across distant video clips and (ii) stop processing once sufficient evidence has been gathered. We propose AdaptToken, a training-free framework that turns an MLLM's self-uncertainty into a global control signal for long-video token selection. AdaptToken splits a video into groups, extracts cross-modal attention to rank tokens within each group, and uses the model's response entropy to estimate each group's prompt relevance. This entropy signal enables a global token budget allocation across groups and further supports early stopping (AdaptToken-Lite), skipping the remaining groups when the model becomes sufficiently certain. Across four long-video benchmarks (VideoMME, LongVideoBench, LVBench, and MLVU) and multiple base MLLMs (7B-72B), AdaptToken consistently improves accuracy (e.g., +6.7 on average over Qwen2.5-VL 7B) and continues to benefit from extremely long inputs (up to 10K frames), while AdaptToken-Lite reduces inference time by about half with comparable performance. Project page: https://haozheqi.github.io/adapt-token
翻译:长视频理解对多模态大语言模型(MLLMs)仍具挑战性,主要受限于高内存成本与上下文长度限制。现有方法通过评分并选择短片段内的帧/令牌进行优化,但缺乏一种原则性机制来:(i)比较不同远距离视频片段间的相关性,(ii)在收集到充分证据后停止处理。我们提出AdaptToken——一种无需训练的框架,将MLLM的自我不确定性转化为长视频令牌选择的全局控制信号。AdaptToken将视频拆分为群组,通过提取跨模态注意力对每组内令牌进行排序,并利用模型响应熵估计各群组的提示相关性。该熵信号实现了跨群组的全局令牌预算分配,并进一步支持早停机制(AdaptToken-Lite):当模型置信度充分时,跳过剩余群组处理。在四项长视频基准(VideoMME、LongVideoBench、LVBench和MLVU)及多基座MLLM(7B-72B)上,AdaptToken持续提升准确率(如基于Qwen2.5-VL 7B平均提升6.7%),在超长输入(达10K帧)中仍能获益;而AdaptToken-Lite在保持可比性能下将推理时间缩减近半。项目页面:https://haozheqi.github.io/adapt-token