Road traffic crashes cause millions of deaths annually and have a significant economic impact, particularly in low- and middle-income countries (LMICs). This paper presents an approach using Vision Language Models (VLMs) for road safety assessment, overcoming the limitations of traditional Convolutional Neural Networks (CNNs). We introduce a new task ,V-RoAst (Visual question answering for Road Assessment), with a real-world dataset. Our approach optimizes prompt engineering and evaluates advanced VLMs, including Gemini-1.5-flash and GPT-4o-mini. The models effectively examine attributes for road assessment. Using crowdsourced imagery from Mapillary, our scalable solution influentially estimates road safety levels. In addition, this approach is designed for local stakeholders who lack resources, as it does not require training data. It offers a cost-effective and automated methods for global road safety assessments, potentially saving lives and reducing economic burdens.
翻译:道路交通事故每年导致数百万人死亡,并产生重大的经济影响,在中低收入国家尤为严重。本文提出了一种利用视觉语言模型进行道路安全评估的方法,克服了传统卷积神经网络的局限性。我们引入了一项新任务——V-RoAst(用于道路评估的视觉问答),并提供了一个真实世界数据集。我们的方法优化了提示工程,并评估了包括Gemini-1.5-flash和GPT-4o-mini在内的先进视觉语言模型。这些模型能有效检验用于道路评估的属性指标。通过利用Mapillary的众包图像数据,我们提出的可扩展解决方案能够高效地估算道路安全水平。此外,该方法专为缺乏资源的本地利益相关者设计,无需训练数据即可使用。它为全球道路安全评估提供了一种经济高效且自动化的方法,有望挽救生命并减轻经济负担。