In the vast and dynamic landscape of urban settings, Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention. This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios. CityLLaVA enhances model comprehension and prediction accuracy through (1) employing bounding boxes for optimal visual data preprocessing, including video best-view selection and visual prompt engineering during both training and testing phases; (2) constructing concise Question-Answer sequences and designing textual prompts to refine instruction comprehension; (3) implementing block expansion to fine-tune large VLMs efficiently; and (4) advancing prediction accuracy via a unique sequential questioning-based prediction augmentation. Demonstrating top-tier performance, our method achieved a benchmark score of 33.4308, securing the leading position on the leaderboard. The code can be found: https://github.com/alibaba/AICITY2024_Track2_AliOpenTrek_CityLLaVA
翻译:在广阔且动态的城市环境中,交通安全描述与分析在从保险勘查到事故预防等应用中发挥着关键作用。本文提出CityLLaVA——一种面向城市场景的新型视觉语言模型微调框架。CityLLaVA通过以下方式提升模型理解能力与预测精度:(1) 采用边界框实现最优视觉数据预处理,包括训练与测试阶段的视频最佳视角选择及视觉提示工程;(2) 构建精简问答序列并设计文本提示以优化指令理解;(3) 通过块扩展方法高效微调大视觉语言模型;(4) 提出基于序列追问的创新预测增强方法以提升预测精度。本方法展现出顶尖性能,以33.4308的基准评分位居排行榜首位。代码详见:https://github.com/alibaba/AICITY2024_Track2_AliOpenTrek_CityLLaVA