Traditional approaches to off-road autonomy rely on separate models for terrain classification, height estimation, and quantifying slip or slope conditions. Utilizing several models requires training each component separately, having task specific datasets, and fine-tuning. In this work, we present a zero-shot approach leveraging SAM2 for environment segmentation and a vision-language model (VLM) to reason about drivable areas. Our approach involves passing to the VLM both the original image and the segmented image annotated with numeric labels for each mask. The VLM is then prompted to identify which regions, represented by these numeric labels, are drivable. Combined with planning and control modules, this unified framework eliminates the need for explicit terrain-specific models and relies instead on the inherent reasoning capabilities of the VLM. Our approach surpasses state-of-the-art trainable models on high resolution segmentation datasets and enables full stack navigation in our Isaac Sim offroad environment.
翻译:传统越野自主导航方法依赖于分别构建地形分类、高度估计及滑移/坡度条件量化的独立模型。使用多个模型需要单独训练每个组件、配备特定任务数据集并进行微调。本文提出一种零样本方法,利用SAM2进行环境分割,并通过视觉-语言模型(VLM)推理可行驶区域。该方法将原始图像与标注了每个掩膜数值标签的分割图像同时输入VLM,随后提示VLM识别由这些数值标签表示的哪些区域具有可行驶性。结合规划与控制模块,这一统一框架消除了对特定地形显式模型的需求,转而依赖VLM固有的推理能力。在高分辨率分割数据集上,本方法超越了当前最优的可训练模型,并在Isaac Sim越野环境中实现了全栈导航。