The integration of thermal imaging data with Multimodal Large Language Models (MLLMs) constitutes an exciting opportunity for improving the safety and functionality of autonomous driving systems and many Intelligent Transportation Systems (ITS) applications. This study investigates whether MLLMs can understand complex images from RGB and thermal cameras and detect objects directly. Our goals were to 1) assess the ability of the MLLM to learn from information from various sets, 2) detect objects and identify elements in thermal cameras, 3) determine whether two independent modality images show the same scene, and 4) learn all objects using different modalities. The findings showed that both GPT-4 and Gemini were effective in detecting and classifying objects in thermal images. Similarly, the Mean Absolute Percentage Error (MAPE) for pedestrian classification was 70.39% and 81.48%, respectively. Moreover, the MAPE for bike, car, and motorcycle detection were 78.4%, 55.81%, and 96.15%, respectively. Gemini produced MAPE of 66.53%, 59.35% and 78.18% respectively. This finding further demonstrates that MLLM can identify thermal images and can be employed in advanced imaging automation technologies for ITS applications.
翻译:将热成像数据与多模态大语言模型(MLLMs)相结合,为提升自动驾驶系统及众多智能交通系统(ITS)应用的安全性与功能性带来了令人振奋的机遇。本研究探讨了MLLMs能否理解来自RGB与热成像相机的复杂图像并直接检测物体。我们的目标是:1)评估MLLM从不同数据集信息中学习的能力;2)检测物体并识别热成像相机中的元素;3)判断两幅独立模态的图像是否呈现同一场景;4)利用不同模态学习所有物体。研究结果表明,GPT-4与Gemini均能有效检测并分类热成像图像中的物体。具体而言,在行人分类任务中,两者的平均绝对百分比误差(MAPE)分别为70.39%与81.48%。此外,在自行车、汽车和摩托车的检测任务中,GPT-4的MAPE分别为78.4%、55.81%和96.15%;而Gemini的MAPE则分别为66.53%、59.35%和78.18%。这一发现进一步证明,MLLM能够识别热成像图像,并可应用于面向ITS应用的先进成像自动化技术中。