Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.
翻译:视觉语言模型(VLMs)能够从图像和视频中生成对复杂活动的高层次、可解释的描述,这使其在态势感知(SA)应用中具有重要价值。在此类应用场景中,核心目标是以高可靠性和高准确性识别偶发但关键的事件,同时提取细粒度的细节信息并评估识别质量。本文提出一种方法,通过显式的逻辑推理将VLMs与传统计算机视觉方法相结合,从以下三个关键方面增强态势感知能力:(a)提取细粒度的事件细节;(b)采用一种智能微调(FT)策略,其准确率显著高于无指导的选择方法;(c)在推理过程中为VLM的输出生成合理性解释。我们证明,所提出的智能FT机制不仅提高了准确率,而且在推理过程中提供了一种有效手段,可用于确认VLM输出的有效性或指出其可能存在的问题。