Vision-language models (VLMs) have demonstrated impressive multimodal comprehension capabilities and are being deployed in an increasing number of online video understanding applications. While recent efforts extensively explore advancing VLMs' reasoning power in these cases, deployment constraints are overlooked, leading to overwhelming system overhead in real-world deployments. To address that, we propose Venus, an on-device memory-and-retrieval system for efficient online video understanding. Venus proposes an edge-cloud disaggregated architecture that sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages. In the ingestion stage, Venus continuously processes streaming edge videos via scene segmentation and clustering, where the selected keyframes are embedded with a multimodal embedding model to build a hierarchical memory for efficient storage and retrieval. In the querying stage, Venus indexes incoming queries from memory, and employs a threshold-based progressive sampling algorithm for keyframe selection that enhances diversity and adaptively balances system cost and reasoning accuracy. Our extensive evaluation shows that Venus achieves a 15x-131x speedup in total response latency compared to state-of-the-art methods, enabling real-time responses within seconds while maintaining comparable or even superior reasoning accuracy.
翻译:视觉语言模型(VLM)已展现出卓越的多模态理解能力,并正被部署于日益增多的在线视频理解应用中。尽管近期研究广泛探索了在这些场景下提升VLM的推理能力,但部署约束常被忽视,导致实际部署中产生巨大的系统开销。为解决此问题,我们提出了Venus,一种用于高效在线视频理解的设备端内存与检索系统。Venus提出了一种边云解耦架构,将内存构建与关键帧检索从云端下沉至边缘,分两个阶段运行。在摄取阶段,Venus通过场景分割与聚类持续处理流式边缘视频,并利用多模态嵌入模型对选取的关键帧进行嵌入,以构建分层内存实现高效存储与检索。在查询阶段,Venus从内存中索引传入的查询,并采用一种基于阈值的渐进采样算法进行关键帧选择,该算法增强了多样性并自适应地平衡系统成本与推理精度。我们的大量实验评估表明,与现有先进方法相比,Venus在总响应延迟上实现了15倍至131倍的加速,能够在数秒内实现实时响应,同时保持相当甚至更优的推理精度。