Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork dataset, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8$\times$) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance +14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).
翻译:感知并自主导航穿越施工区域是一个具有挑战性且尚未被充分探索的问题。针对此类长尾场景的开放数据集十分稀缺。我们提出了ROADWork数据集,旨在学习识别、观察、分析并穿越施工区域。现有的基础模型在应用于施工区域时表现不佳。在我们的数据集上进行微调,显著提升了模型在施工区域内的感知与导航能力。借助ROADWork数据集,我们能够以更高的精度(+32.5%)和更快的速度(12.8倍)在全球范围内发现新的施工区域图像。开放词汇方法同样失效,而经过微调的检测器则提升了性能(+32.2 AP)。视觉-语言模型(VLMs)难以准确描述施工区域,但微调后性能大幅改善(+36.7 SPICE)。除了微调,我们还展示了简单技术的价值。视频标签传播为实例分割带来了额外增益(+2.6 AP)。在识别施工区域标志时,通过裁剪缩放组合检测器和文本定位器,性能提升了+14.2%(1-NED)。组合施工区域检测结果以提供上下文信息,进一步减少了视觉-语言模型中的幻觉(+3.9 SPICE)。我们根据施工区域视频预测导航目标并计算可行驶路径。结合道路施工语义信息,确保了53.6%的目标角度误差(AE)< 0.5(提升9.9%),以及75.3%的路径角度误差(AE)< 0.5(提升8.1%)。