This study aims to comprehensively review and empirically evaluate the application of multimodal large language models (MLLMs) and Large Vision Models (VLMs) in object detection for transportation systems. In the first fold, we provide a background about the potential benefits of MLLMs in transportation applications and conduct a comprehensive review of current MLLM technologies in previous studies. We highlight their effectiveness and limitations in object detection within various transportation scenarios. The second fold involves providing an overview of the taxonomy of end-to-end object detection in transportation applications and future directions. Building on this, we proposed empirical analysis for testing MLLMs on three real-world transportation problems that include object detection tasks namely, road safety attributes extraction, safety-critical event detection, and visual reasoning of thermal images. Our findings provide a detailed assessment of MLLM performance, uncovering both strengths and areas for improvement. Finally, we discuss practical limitations and challenges of MLLMs in enhancing object detection in transportation, thereby offering a roadmap for future research and development in this critical area.
翻译:本研究旨在全面综述并实证评估多模态大语言模型(MLLMs)与大视觉模型(VLMs)在交通系统目标检测中的应用。第一部分,我们阐述了MLLMs在交通应用中的潜在优势,并对现有研究中的MLLM技术进行了系统综述,重点分析了其在各类交通场景目标检测中的效能与局限。第二部分,概述了交通应用中端到端目标检测的分类体系及未来发展方向。在此基础上,我们针对三个包含目标检测任务的真实交通问题——道路安全属性提取、安全关键事件检测以及热成像图像的视觉推理——提出了MLLMs的实证分析框架。研究结果提供了对MLLM性能的详细评估,揭示了其优势与待改进之处。最后,我们探讨了MLLMs在提升交通目标检测能力方面的实际局限与挑战,从而为这一关键领域的未来研究与发展提供了路线图。