Culverts on canals such as the Erie Canal, built originally in 1825, require frequent inspections to ensure safe operation. Human inspection of culverts is challenging due to age, geometry, poor illumination, weather, and lack of easy access. We introduce VISION, an end-to-end, language-in-the-loop autonomy system that couples a web-scale vision-language model (VLM) with constrained viewpoint planning for autonomous inspection of culverts. Brief prompts to the VLM solicit open-vocabulary ROI proposals with rationales and confidences, stereo depth is fused to recover scale, and a planner -- aware of culvert constraints -- commands repositioning moves to capture targeted close-ups. Deployed on a quadruped in a culvert under the Erie Canal, VISION closes the see, decide, move, re-image loop on-board and produces high-resolution images for detailed reporting without domain-specific fine-tuning. In an external evaluation by New York Canal Corporation personnel, initial ROI proposals achieved 61.4\% agreement with subject-matter experts, and final post-re-imaging assessments reached 80\%, indicating that VISION converts tentative hypotheses into grounded, expert-aligned findings.
翻译:伊利运河等水道上的涵洞(最初建于1825年)需要频繁检测以确保安全运行。由于年代久远、几何结构复杂、照明不良、天气条件恶劣以及难以接近等因素,人工检测涵洞面临诸多挑战。本文介绍了VISION系统,这是一种端到端的语言回路自主系统,它将网络规模的视觉语言模型(VLM)与受限视点规划相结合,用于涵洞的自主检测。通过向VLM输入简短提示,系统可获取带有原理说明和置信度的开放词汇感兴趣区域建议;通过融合立体深度信息恢复尺度;同时,一个考虑涵洞约束的规划器指令重新定位移动以捕获目标特写。该系统在伊利运河下方涵洞内的四足机器人上部署实施,VISION在机载设备上完成了“观察-决策-移动-重成像”的闭环流程,并生成了用于详细报告的高分辨率图像,且无需领域特定的微调。根据纽约运河公司人员的外部评估,初始感兴趣区域建议与领域专家的吻合度达到61.4%,而最终重成像后的评估吻合度提升至80%,这表明VISION能够将初步假设转化为基于实际观察且与专家判断一致的结论。