Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CodecSight, a codec-guided streaming video analytics system, built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecSight treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CodecSight achieves an improvement in throughput of up to 3$\times$, and a reduction of up to 87% in GPU compute over state-of-the-art baselines, maintaining competitive accuracy with only 0$\sim$8% F1 drop.
翻译:视频流分析是视觉语言模型服务中的关键负载,但多模态推理的高昂成本限制了其可扩展性。现有系统通过利用视频流中的时空冗余来降低推理成本,但这些方法要么针对视觉Transformer(ViT),要么以有限的视角处理大语言模型(LLM),未能发掘端到端的优化潜力。此外,现有方法为识别冗余引入了显著开销——无论是通过离线剖析与训练,还是通过高成本的在线计算——使其难以适应动态实时流。我们提出CodecSight,一种编解码引导的流式视频分析系统,其核心发现是:视频编解码器在压缩过程中已作为副产品提取了每个流的时空结构。CodecSight利用这一编解码元数据作为低成本运行时信号,统一优化视频解码、视觉处理与LLM预填充,且由于直接操作于压缩比特流,传输缩减成为其固有优势。由此驱动ViT编码前的编解码引导补丁剪枝,以及LLM预填充过程中的选择性键值缓存刷新——两者均为完全在线实现,无需离线训练。实验表明,相比最先进基准,CodecSight将吞吐量提升高达3倍,GPU计算量降低87%,同时保持竞争性精度,F1值仅下降0%~8%。