The recognition and understanding of traffic incidents, particularly traffic accidents, is a topic of paramount importance in the realm of intelligent transportation systems and intelligent vehicles. This area has continually captured the extensive focus of both the academic and industrial sectors. Identifying and comprehending complex traffic events is highly challenging, primarily due to the intricate nature of traffic environments, diverse observational perspectives, and the multifaceted causes of accidents. These factors have persistently impeded the development of effective solutions. The advent of large vision-language models (VLMs) such as GPT-4V, has introduced innovative approaches to addressing this issue. In this paper, we explore the ability of GPT-4V with a set of representative traffic incident videos and delve into the model's capacity of understanding these complex traffic situations. We observe that GPT-4V demonstrates remarkable cognitive, reasoning, and decision-making ability in certain classic traffic events. Concurrently, we also identify certain limitations of GPT-4V, which constrain its understanding in more intricate scenarios. These limitations merit further exploration and resolution.
翻译:交通事故,尤其是交通意外的识别与理解,是智能交通系统与智能车辆领域中至关重要的课题。这一领域持续吸引着学术界与工业界的广泛关注。识别和理解复杂交通事件极具挑战性,主要源于交通环境的错综复杂、观测视角的多样性以及事故成因的多重性。这些因素始终阻碍着有效解决方案的开发。诸如GPT-4V等大型视觉语言模型的出现,为解决该问题引入了创新方法。本文通过一组具有代表性的交通事故视频,探究了GPT-4V的能力,并深入分析了该模型对复杂交通场景的理解能力。我们观察到,GPT-4V在若干典型交通事件中展现出卓越的认知、推理与决策能力。同时,我们也发现了GPT-4V的某些局限性,这些局限限制了其在更复杂场景中的理解力,值得进一步探索与解决。