Vision-Language Models (VLMs) face a critical memory bottleneck when processing long-form video content due to the linear growth of the Key-Value (KV) cache with sequence length. Existing solutions predominantly employ reactive eviction strategies that compute full attention matrices before discarding tokens, resulting in substantial computational waste. We propose Sali-Cache, a novel a priori optimization framework that implements dual-signal adaptive caching through proactive memory management. By integrating a temporal filter based on optical flow analysis for detecting inter-frame redundancy and a spatial filter leveraging saliency detection for identifying visually significant regions, Sali-Cache intelligently manages memory allocation before entering computationally expensive attention operations. Experimental evaluation on the LLaVA 1.6 architecture demonstrates that our method achieves a 2.20x compression ratio in effective memory usage while maintaining 100% accuracy across BLEU, ROUGE-L, and Exact Match metrics. Furthermore, under identical memory budget constraints, Sali-Cache preserves context-rich features over extended temporal durations without degrading model performance, enabling efficient processing of long-form video content on consumer-grade hardware.
翻译:视觉语言模型在处理长视频内容时面临严重的内存瓶颈,其键值缓存随序列长度呈线性增长。现有解决方案主要采用被动淘汰策略,即在计算完整注意力矩阵后丢弃冗余标记,导致大量计算资源浪费。本文提出Sali-Cache——一种新型先验优化框架,通过主动内存管理实现双信号自适应缓存。该框架集成基于光流分析的时间滤波器(用于检测帧间冗余)与利用显著性检测的空间滤波器(用于识别视觉关键区域),在进入计算密集的注意力操作前智能管理内存分配。基于LLaVA 1.6架构的实验评估表明,该方法在保持BLEU、ROUGE-L和精确匹配指标100%准确率的同时,实现了有效内存使用2.20倍的压缩比。此外,在相同内存预算约束下,Sali-Cache能在延长的时间跨度内保持上下文丰富特征且不降低模型性能,使得消费级硬件能够高效处理长视频内容。