Traditional approaches to safety event analysis in autonomous systems have relied on complex machine learning models and extensive datasets for high accuracy and reliability. However, the advent of Multimodal Large Language Models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities, thereby providing automated analyses of driving videos. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts to ensure accurate, reliable, and actionable insights for hazard detection. By incorporating models like Gemini-Pro-Vision 1.5 and Llava, our methodology aims to automate the safety critical events and mitigate common issues such as hallucinations in MLLM outputs. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis, though further validation on larger datasets is necessary. Furthermore, more investigations are required to explore the performance enhancements of the proposed framework through few-shot learning and fine-tuned models. This research underscores the significance of MLLMs in advancing the analysis of the naturalistic driving videos by improving safety-critical event detecting and understanding the interaction with complex environments.
翻译:传统自动驾驶系统中的安全事件分析方法依赖复杂的机器学习模型和大量数据集以实现高精度和高可靠性。然而,多模态大语言模型的出现提供了一种新颖的解决方案,它通过整合文本、视觉和音频模态,实现对驾驶视频的自动化分析。我们的框架利用MLLMs的推理能力,通过特定情境提示引导其输出,从而为危险检测提供准确、可靠且可操作的见解。通过整合Gemini-Pro-Vision 1.5和Llava等模型,我们的方法旨在实现安全关键事件的自动化检测,并缓解MLLM输出中常见的幻觉等问题。初步结果表明该框架在零样本学习和精确场景分析方面具有潜力,但仍需在更大数据集上进行进一步验证。此外,需要通过小样本学习和微调模型来探索所提框架的性能提升,这有待更多研究。本研究强调了MLLMs在推进自然驾驶视频分析方面的重要性,其通过改进安全关键事件检测及理解与复杂环境的交互来实现这一目标。