Large vision-language models (VLMs) are enabling interactive video reasoning, giving rise to streaming long-video understanding. In this setting, frames arrive continuously, while the system preserves long-term context and generates responses under strict latency constraints. A central challenge is KVCache management: as video streams grow, KVCache expands rapidly, increasing computation and memory overhead. Existing retrieval-based approaches exploit attention sparsity and offload inactive KVCache from GPU to CPU memory, but their token-level design causes high management overhead and fragmented data movement. We present Mosaic, the first cluster-driven VLM inference system for streaming long-video understanding. Our key insight is that VLM KVCache exhibits an implicit cross-modal clustering structure: retrieved KV states form groups jointly shaped by visual coherence and semantic relevance. Based on this observation, Mosaic uses cross-modal clusters as the basic unit of KVCache organization, maintenance, and retrieval. Evaluations show that Mosaic outperforms state-of-the-art baselines, achieving up to 1.38x speedup.
翻译:大型视觉语言模型(VLM)正推动交互式视频推理的发展,催生了流式长视频理解任务。在该场景中,视频帧持续到达,系统需在严格延迟约束下维持长期上下文并生成响应。核心挑战在于KVCache管理:随着视频流增长,KVCache迅速膨胀,导致计算与内存开销激增。现有基于检索的方法利用注意力稀疏性将非活跃KVCache从GPU内存卸载至CPU内存,但其令牌级设计引发高昂的管理开销与碎片化数据移动。我们提出Mosaic,首个面向流式长视频理解的簇驱动型VLM推理系统。关键洞察在于:VLM的KVCache呈现隐式多模态聚类结构——被检索的KV状态形成由视觉连贯性与语义相关性共同塑造的群组。基于此观察,Mosaic以跨模态簇作为KVCache组织、维护与检索的基本单元。评估表明,Mosaic超越当前最先进基线,实现高达1.38倍加速。