Zelda: Video Analytics using Vision-Language Models

Advances in ML have motivated the design of video analytics systems that allow for structured queries over video datasets. However, existing systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. This paper focuses on the recently developed Vision-Language Models (VLMs) that allow users to query images using natural language like "cars during daytime at traffic intersections." Through an in-depth analysis, we show VLMs address three limitations of current video analytics systems: general expressivity, a single general purpose model to query many predicates, and are both simple and fast. However, VLMs still return large numbers of redundant and low-quality results, which can overwhelm and burden users. We present Zelda: a video analytics system that uses VLMs to return both relevant and semantically diverse results for top-K queries on large video datasets. Zelda prompts the VLM with the user's query in natural language and additional terms to improve accuracy and identify low-quality frames. Zelda improves result diversity by leveraging the rich semantic information encoded in VLM embeddings. We evaluate Zelda across five datasets and 19 queries and quantitatively show it achieves higher mean average precision (up to 1.15$\times$) and improves average pairwise similarity (up to 1.16$\times$) compared to using VLMs out-of-the-box. We also compare Zelda to a state-of-the-art video analytics engine and show that Zelda retrieves results 7.5$\times$ (up to 10.4$\times$) faster for the same accuracy and frame diversity.

翻译：机器学习的最新进展推动了视频分析系统的设计，使其能够对视频数据集执行结构化查询。然而，现有系统存在查询表达能力受限、要求用户为每个谓词指定一个ML模型、依赖复杂的优化以实现准确性与性能的权衡，并返回大量冗余且低质量的查询结果等问题。本文聚焦于近年来开发的视觉语言模型（VLM），该模型允许用户使用自然语言进行图像查询，例如"白天交通路口的车辆"。通过深入分析，我们证明VLM解决了当前视频分析系统的三大局限：通用表达能力、单一通用模型即可查询多个谓词，以及兼具简洁性与高效性。然而，VLM仍会返回大量冗余和低质量的结果，可能使用户不堪重负。我们提出Zelda：一个基于VLM的视频分析系统，能够针对大型视频数据集上的Top-K查询，返回既相关又语义多样化的结果。Zelda使用用户的自然语言查询以及额外术语提示VLM，以提升准确性并识别低质量帧。通过利用VLM嵌入中编码的丰富语义信息，Zelda提升了结果多样性。我们在五个数据集和19个查询上评估了Zelda，定量结果表明，与直接使用VLM相比，它实现了更高的平均精度均值（提升达1.15倍），并改善了平均成对相似度（提升达1.16倍）。我们还与最先进的视频分析引擎进行对比，结果显示在相同准确度和帧多样性条件下，Zelda的检索速度快7.5倍（最高达10.4倍）。