Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.
翻译:[translated abstract in Chinese]
视频流分析是视觉语言模型服务中的关键负载,但多模态推理的高昂成本限制了其可扩展性。现有系统通过利用视频流中的时间与空间冗余来降低推理成本,但其优化视角局限于视觉Transformer(ViT)或大语言模型(LLM)的单一模块,未能挖掘端到端优化机会。此外,现有方法在识别冗余时需承担显著开销——无论是通过离线分析与训练,还是依赖高成本的在线计算——使其难以适应动态实时流场景。本文提出CoStream,一种基于编解码引导的流式视频分析系统,其核心基于关键发现:视频编解码器在压缩过程中已作为副产品提取了每段流的时空结构信息。CoStream将编解码元数据视为低成本运行时信号,统一优化视频解码、视觉处理与LLM预填充三个环节,并因直接操作压缩比特流的特性天然实现传输开销降低。据此,系统在ViT编码前驱动编解码引导的块剪枝,并在LLM预填充阶段选择性刷新键值缓存,全程无需离线训练。实验表明,CoStream在保持竞争力的精度(F1分数仅下降0-8%)同时,相较最先进基线实现了最高3倍吞吐量提升与87%的GPU计算量降低。