The recognition and understanding of traffic incidents, particularly traffic accidents, is a topic of paramount importance in the realm of intelligent transportation systems and intelligent vehicles. This area has continually captured the extensive focus of both the academic and industrial sectors. Identifying and comprehending complex traffic events is highly challenging, primarily due to the intricate nature of traffic environments, diverse observational perspectives, and the multifaceted causes of accidents. These factors have persistently impeded the development of effective solutions. The advent of large vision-language models (VLMs) such as GPT-4V, has introduced innovative approaches to addressing this issue. In this paper, we explore the ability of GPT-4V with a set of representative traffic incident videos and delve into the model's capacity of understanding these complex traffic situations. We observe that GPT-4V demonstrates remarkable cognitive, reasoning, and decision-making ability in certain classic traffic events. Concurrently, we also identify certain limitations of GPT-4V, which constrain its understanding in more intricate scenarios. These limitations merit further exploration and resolution.
翻译:交通事故的识别与理解,尤其是交通事故本身,是智能交通系统和智能车辆领域至关重要的研究课题。该方向持续吸引着学术界和工业界的广泛关注。识别和理解复杂交通事件极具挑战性,主要原因在于交通环境的复杂性、观察视角的多样性以及事故成因的多维性。这些因素始终阻碍着有效解决方案的研发。以GPT-4V为代表的大型视觉语言模型的出现,为解决这一问题提供了创新思路。本文通过一组具有代表性的交通事故视频,深入探究了GPT-4V理解复杂交通情境的能力。我们观察到,GPT-4V在若干经典交通事件中展现出卓越的认知、推理与决策能力。同时,我们也发现了GPT-4V在更复杂场景下制约其理解能力的某些局限性,这些局限有待进一步探索与解决。