Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.
翻译:视频-语言模型(VLMs)正在重塑视频查询服务,为复杂的感知与推理任务提供统一解决方案。然而,在现实系统中部署大型VLMs仍面临挑战,因其资源需求高,且基于远程的部署常导致不可接受的响应延迟。尽管小型、可本地部署的VLMs能提供更快的响应,但其准确性不可避免地存在不足。为平衡这一权衡,我们提出QuickGrasp——一个响应式、服务质量(QoS)感知的系统,通过采用本地优先架构与按需边缘增强来弥合这一差距。基于VLMs高度模块化的架构,QuickGrasp在不同模型变体间共享视觉表示,以避免冗余计算。为最大化系统整体效率,QuickGrasp引入了三项关键设计:加速视频令牌化、查询自适应边缘增强,以及延迟感知且保持准确性的视觉令牌密度配置。我们实现了QuickGrasp的原型系统,并在多个视频理解基准测试中进行了评估。结果表明,QuickGrasp在匹配大型VLMs准确性的同时,实现了高达12.8倍的响应延迟降低。QuickGrasp代表了向构建充分利用VLMs能力、面向开放世界理解的响应式视频查询服务迈出的关键一步。