Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events

Traditional approaches to safety event analysis in autonomous systems have relied on complex machine learning models and extensive datasets for high accuracy and reliability. However, the advent of Multimodal Large Language Models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities, thereby providing automated analyses of driving videos. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts to ensure accurate, reliable, and actionable insights for hazard detection. By incorporating models like Gemini-Pro-Vision 1.5 and Llava, our methodology aims to automate the safety critical events and mitigate common issues such as hallucinations in MLLM outputs. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis, though further validation on larger datasets is necessary. Furthermore, more investigations are required to explore the performance enhancements of the proposed framework through few-shot learning and fine-tuned models. This research underscores the significance of MLLMs in advancing the analysis of the naturalistic driving videos by improving safety-critical event detecting and understanding the interaction with complex environments.

翻译：传统自动驾驶系统中的安全事件分析方法依赖复杂的机器学习模型和大量数据集以实现高精度和高可靠性。然而，多模态大语言模型的出现提供了一种新颖的解决方案，它通过整合文本、视觉和音频模态，实现对驾驶视频的自动化分析。我们的框架利用MLLMs的推理能力，通过特定情境提示引导其输出，从而为危险检测提供准确、可靠且可操作的见解。通过整合Gemini-Pro-Vision 1.5和Llava等模型，我们的方法旨在实现安全关键事件的自动化检测，并缓解MLLM输出中常见的幻觉等问题。初步结果表明该框架在零样本学习和精确场景分析方面具有潜力，但仍需在更大数据集上进行进一步验证。此外，需要通过小样本学习和微调模型来探索所提框架的性能提升，这有待更多研究。本研究强调了MLLMs在推进自然驾驶视频分析方面的重要性，其通过改进安全关键事件检测及理解与复杂环境的交互来实现这一目标。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日