Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
翻译:近年来,多模态大语言模型(MLLMs)在图像识别与推理方面取得了显著进展,但视频相关任务仍因密集帧处理带来的内存限制而面临挑战。现有的视频片段检索(VMR)方法依赖于稀疏帧采样,可能导致信息丢失,尤其是在长视频中。我们提出了SMORE(见多存少)框架,该框架在保持高信息分辨率的同时提升了内存效率。SMORE(1)利用查询引导的文本描述来编码与用户意图对齐的语义信息;(2)应用查询感知的重要性调制以突出相关片段;(3)自适应压缩视频帧,在保留关键内容的同时减少冗余。这使得系统能够在不超过内存预算的前提下实现高效的视频理解。实验验证表明,SMORE在QVHighlights、Charades-STA和ActivityNet-Captions基准测试中均达到了最先进的性能。