Zelda: Video Analytics using Vision-Language Models

Advances in ML have motivated the design of video analytics systems that allow for structured queries over video datasets. However, existing systems limit query expressivity, require users to specify an ML model per predicate, rely on complex optimizations that trade off accuracy for performance, and return large amounts of redundant and low-quality results. This paper focuses on the recently developed Vision-Language Models (VLMs) that allow users to query images using natural language like "cars during daytime at traffic intersections." Through an in-depth analysis, we show VLMs address three limitations of current video analytics systems: general expressivity, a single general purpose model to query many predicates, and are both simple and fast. However, VLMs still return large numbers of redundant and low-quality results that can overwhelm and burden users. In addition, VLMs often require manual prompt engineering to improve result relevance. We present Zelda: a video analytics system that uses VLMs to return both relevant and semantically diverse results for top-K queries on large video datasets. Zelda prompts the VLM with the user's query in natural language. Zelda then automatically adds discriminator and synonym terms to boost accuracy, and terms to identify low-quality frames. To improve result diversity, Zelda uses semantic-rich VLM embeddings in an algorithm that prunes similar frames while considering their relevance to the query and the number of top-K results requested. We evaluate Zelda across five datasets and 19 queries and quantitatively show it achieves higher mean average precision (up to 1.15x) and improves average pairwise similarity (up to 1.16x) compared to using VLMs out-of-the-box. We also compare Zelda to a state-of-the-art video analytics engine and show that Zelda retrieves results 7.5x (up to 10.4x) faster for the same accuracy and frame diversity.

翻译：机器学习领域的进展推动了能够对视频数据集执行结构化查询的视频分析系统的设计。然而现有系统存在以下局限：限制查询表达能力、要求用户为每个谓词指定独立机器学习模型、依赖牺牲准确性换取性能的复杂优化策略，且返回大量冗余低质结果。本文聚焦近年发展的视觉语言模型（Vision-Language Models, VLMs），该类模型允许用户使用"交通路口白天的车辆"等自然语言查询图像。通过深入分析，我们证明VLMs可解决当前视频分析系统的三个局限性：通用表达能力、单一通用模型即可查询多个谓词，以及兼具简洁性与高效性。但VLMs仍会返回大量令用户不堪重负的冗余低质结果，且常需人工设计提示词（prompt engineering）以提升结果相关性。我们提出Zelda：一种利用VLMs对大规模视频数据集执行Top-K查询时，能同时保证结果相关性与语义多样性的视频分析系统。Zelda以用户自然语言查询作为VLM提示词，自动添加判别词与同义词增强准确性，并引入识别低质量帧的术语。为提升结果多样性，Zelda采用语义丰富的VLM嵌入向量，通过算法在考虑查询相关性与Top-K结果数量的前提下剪枝相似帧。我们基于五个数据集与19个查询对Zelda进行评估，定量显示其相比直接使用VLMs，在平均精度均值（提高至1.15倍）与平均成对相似度（提高至1.16倍）指标上均有提升。同时将Zelda与当前最先进的视频分析引擎对比，表明其在相同准确率与帧多样性条件下，查询速度提升7.5倍（最高可达10.4倍）。