PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis

Pavement condition assessment is essential for road safety and maintenance. Existing research has made significant progress. However, most studies focus on conventional computer vision tasks such as classification, detection, and segmentation. In real-world applications, pavement inspection requires more than visual recognition. It also requires quantitative analysis, explanation, and interactive decision support. Current datasets are limited. They focus on unimodal perception. They lack support for multi-turn interaction and fact-grounded reasoning. They also do not connect perception with vision-language analysis. To address these limitations, we introduce PaveBench, a large-scale benchmark for pavement distress perception and interactive vision-language analysis on real-world highway inspection images. PaveBench supports four core tasks: classification, object detection, semantic segmentation, and vision-language question answering. It provides unified task definitions and evaluation protocols. On the visual side, PaveBench provides large-scale annotations and includes a curated hard-distractor subset for robustness evaluation. It contains a large collection of real-world pavement images. On the multimodal side, we introduce PaveVQA, a real-image question answering (QA) dataset that supports single-turn, multi-turn, and expert-corrected interactions. It covers recognition, localization, quantitative estimation, and maintenance reasoning. We evaluate several state-of-the-art methods and provide a detailed analysis. We also present a simple and effective agent-augmented visual question answering framework that integrates domain-specific models as tools alongside vision-language models. The dataset is available at: https://huggingface.co/datasets/MML-Group/PaveBench.

翻译：路面状况评估对于道路安全与维护至关重要。现有研究已取得显著进展，但多数工作聚焦于分类、检测与分割等传统计算机视觉任务。实际应用中，路面检测不仅需要视觉识别，还要求定量分析、解释说明与交互式决策支持。当前数据集存在局限性：仅支持单模态感知，缺乏多轮交互与事实推理能力，且未将感知与视觉-语言分析相融合。针对上述问题，我们提出PaveBench——面向真实高速公路巡检图像的大规模路面病害感知与交互式视觉-语言分析基准数据集。PaveBench支持四大核心任务：分类、目标检测、语义分割以及视觉-语言问答。它提供统一的任务定义与评估协议。在视觉层面，PaveBench提供大规模标注数据，并包含经过精心挑选的困难样本子集用于鲁棒性评估，涵盖大量真实路面图像。在多模态层面，我们提出PaveVQA——一个支持单轮、多轮及专家修正交互的真实图像问答(QA)数据集，覆盖识别、定位、定量估计与维护推理等场景。我们评估了多种最先进方法并给出详细分析，同时提出一种简单有效的代理增强视觉问答框架，该框架将领域专用模型作为工具与视觉-语言模型集成。数据集获取链接：https://huggingface.co/datasets/MML-Group/PaveBench。