This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.
翻译:本文探讨了大语言模型(LLMs)在零样本异常检测中用于安全视觉导航的潜力。借助最先进的实时开放世界目标检测模型Yolo-World以及专用提示词,所提出的框架能够识别相机捕获帧中的异常,包括任何可能的障碍物,并生成简洁的音频描述,重点突出异常情况,从而在复杂环境中辅助安全视觉导航。此外,我们的框架利用了大语言模型和开放词汇目标检测模型的优势,实现了动态场景切换,使用户能够从一种场景平滑过渡到另一种场景,从而解决了传统视觉导航的局限性。进一步地,本文探讨了不同提示成分对性能的贡献,为未来视觉辅助功能的改进提供了愿景,并为大语言模型在视频异常检测和视觉-语言理解中的应用铺平了道路。