End-to-end autonomous driving models increasingly benefit from large vision--language models for semantic understanding, yet ensuring safe and accurate operation under long-tail conditions remains challenging. These challenges are particularly prominent in long-tail mixed-traffic scenarios, where autonomous vehicles must interact with heterogeneous road users, including human-driven vehicles and vulnerable road users, under complex and uncertain conditions. This paper proposes HERMES, a holistic risk-aware end-to-end multimodal driving framework designed to inject explicit long-tail risk cues into trajectory planning. HERMES employs a foundation-model-assisted annotation pipeline to produce structured Long-Tail Scene Context and Long-Tail Planning Context, capturing hazard-centric cues together with maneuver intent and safety preference, and uses these signals to guide end-to-end planning. HERMES further introduces a Tri-Modal Driving Module that fuses multi-view perception, historical motion cues, and semantic guidance, ensuring risk-aware accurate trajectory planning under long-tail scenarios. Experiments on the real-world long-tail dataset demonstrate that HERMES consistently outperforms representative end-to-end and VLM-driven baselines under long-tail mixed-traffic scenarios. Ablation studies verify the complementary contributions of key components.
翻译:端到端自动驾驶模型日益受益于大型视觉-语言模型在语义理解方面的能力,然而确保其在长尾条件下的安全与精确运行仍具挑战。这些挑战在长尾混合交通场景中尤为突出,自动驾驶车辆必须在复杂且不确定的条件下,与包括人类驾驶车辆和弱势道路使用者在内的异构道路使用者进行交互。本文提出HERMES,一种全栈风险感知的端到端多模态驾驶框架,旨在将显式的长尾风险线索注入轨迹规划。HERMES采用基础模型辅助的标注流程,生成结构化的长尾场景上下文与长尾规划上下文,捕获以危险为中心的线索以及驾驶意图与安全偏好,并利用这些信号来指导端到端规划。HERMES进一步引入了三模态驾驶模块,该模块融合了多视角感知、历史运动线索和语义引导,确保在长尾场景下进行风险感知的精确轨迹规划。在真实世界长尾数据集上的实验表明,在长尾混合交通场景下,HERMES持续优于代表性的端到端及VLM驱动的基线方法。消融研究验证了各关键组件的互补性贡献。