The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.
翻译:近年来视频数据产量的空前增长,亟需高效工具从视频中提取有意义的帧以支持下游任务。长期时序推理是帧检索系统的关键需求。尽管最先进的基础模型(如VideoLLaMA和ViCLIP)擅长短期语义理解,但出人意料地无法进行跨帧的长期推理。这一失败的关键原因在于它们将逐帧感知与时序推理交织在单一深度网络中。因此,解耦但协同设计语义理解与时序推理对于高效场景识别至关重要。我们提出一种系统,利用视觉语言模型实现单帧语义理解,同时通过状态机和时序逻辑公式有效推理事件的长期演化过程——这些公式本质上具备记忆捕获能力。在Waymo和NuScenes等最先进自动驾驶数据集上,与使用GPT4进行推理的基准方法相比,我们基于时序逻辑的推理方法将复杂事件识别的F1分数提升了9-15%。