Long-video understanding remains challenging for multimodal large language models, because temporally extended videos often contain thousands of frames and are therefore expensive to process exhaustively. Existing methods usually construct compact visual inputs from long videos under a limited visual budget. However, most of them still follow a frame-centric paradigm and apply similar representations to retained content regardless of its importance. This makes it difficult to preserve both high-fidelity visual evidence and broad temporal coverage. To address this issue, we propose Q-Fold, a training-free input construction framework for long-video understanding. Instead of treating isolated frames as the basic modeling unit, Q-Fold operates on contiguous temporal segments and constructs a heterogeneous Focus--Context representation under query guidance. Query-relevant segments are preserved as high-fidelity Focus Frames, while less relevant segments are folded into chronology-preserving contextual layouts. In this way, Q-Fold preserves critical visual evidence and broad temporal coverage, while better maintaining local temporal continuity within short segments. Experiments on four long-video benchmarks with multiple Video-MLLMs show that Q-Fold consistently improves performance without increasing the input budget. Notably, it achieves gains of up to 9.1 percentage points on an ultra-long video benchmark. Code will be made publicly available.
翻译:长视频理解对多模态大语言模型而言仍具挑战性,因为时间跨度长的视频通常包含数千帧,导致完整处理代价高昂。现有方法通常在大幅限制视觉输入预算的条件下,从长视频中构建紧凑的视觉表征。然而,多数方法仍遵循以帧为中心的处理范式,对保留的内容采用相同表征方式,而不考虑其重要性。这使得同时保留高保真视觉证据与广泛时间覆盖范围变得困难。为解决该问题,我们提出Q-Fold——一种无需训练的长视频理解输入构建框架。Q-Fold不以孤立帧为基本建模单元,而是对连续时间片段进行操作,在查询引导下构建异构的"焦点-上下文"表征:与查询相关的片段被保留为高保真焦点帧,而关联较弱的片段则被折叠为保持时序特性的上下文布局。通过这种方式,Q-Fold在保持关键视觉证据与广泛时间覆盖范围的同时,还能更好地维持短片段内的局部时间连续性。在多个长视频基准测试与多模态大语言模型上的实验表明,Q-Fold能在不增加输入预算的情况下持续提升性能。值得注意的是,在超长视频基准测试上,其性能提升最高达9.1个百分点。代码将公开发布。