Explainability and transparent decision-making are essential for the safe deployment of autonomous driving systems. Scene captioning summarizes environmental conditions and risk factors in natural language, improving transparency, safety, and human--robot interaction. However, most existing approaches target structured urban scenarios; in off-road environments, they are vulnerable to single-modality degradations caused by rain, fog, snow, and darkness, and they lack a unified framework that jointly models structured scene captioning and path planning. To bridge this gap, we propose Wild-Drive, an efficient framework for off-road scene captioning and path planning. Wild-Drive adopts modern multimodal encoders and introduces a task-conditioned modality-routing bridge, MoRo-Former, to adaptively aggregate reliable information under degraded sensing. It then integrates an efficient large language model (LLM), together with a planning token and a gate recurrent unit (GRU) decoder, to generate structured captions and predict future trajectories. We also build the OR-C2P Benchmark, which covers structured off-road scene captioning and path planning under diverse sensor corruption conditions. Experiments on OR-C2P dataset and a self-collected dataset show that Wild-Drive outperforms prior LLM-based methods and remains more stable under degraded sensing. The code and benchmark will be publicly available at https://github.com/wangzihanggg/Wild-Drive.
翻译:可解释性与透明决策对于自动驾驶系统的安全部署至关重要。场景描述以自然语言总结环境条件与风险因素,从而提升透明度、安全性及人机交互能力。然而,现有方法大多针对结构化城市场景;在越野环境中,这些方法易受雨、雾、雪、黑暗等导致的单模态退化影响,且缺乏统一框架来联合建模结构化场景描述与路径规划。为填补这一空白,我们提出Wild-Drive——一种用于越野场景描述与路径规划的高效框架。Wild-Drive采用现代多模态编码器,并引入任务条件化的模态路由桥接模块MoRo-Former,以在感知退化条件下自适应聚合可靠信息。随后,该框架集成高效大语言模型(LLM),结合规划令牌与门控循环单元(GRU)解码器,生成结构化描述并预测未来轨迹。我们还构建了OR-C2P基准数据集,涵盖多种传感器退化条件下的结构化越野场景描述与路径规划任务。在OR-C2P数据集及自采集数据集上的实验表明,Wild-Drive优于现有基于LLM的方法,并在感知退化条件下保持更高稳定性。代码与基准数据集将通过https://github.com/wangzihanggg/Wild-Drive公开。