VLM-RL：一种用于安全自动驾驶的统一视觉语言模型与强化学习框架 (VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving)

In recent years, reinforcement learning (RL)-based methods for learning driving policies have gained increasing attention in the autonomous driving community and have achieved remarkable progress in various driving scenarios. However, traditional RL approaches rely on manually engineered rewards, which require extensive human effort and often lack generalizability. To address these limitations, we propose \textbf{VLM-RL}, a unified framework that integrates pre-trained Vision-Language Models (VLMs) with RL to generate reward signals using image observation and natural language goals. The core of VLM-RL is the contrasting language goal (CLG)-as-reward paradigm, which uses positive and negative language goals to generate semantic rewards. We further introduce a hierarchical reward synthesis approach that combines CLG-based semantic rewards with vehicle state information, improving reward stability and offering a more comprehensive reward signal. Additionally, a batch-processing technique is employed to optimize computational efficiency during training. Extensive experiments in the CARLA simulator demonstrate that VLM-RL outperforms state-of-the-art baselines, achieving a 10.5\% reduction in collision rate, a 104.6\% increase in route completion rate, and robust generalization to unseen driving scenarios. Furthermore, VLM-RL can seamlessly integrate almost any standard RL algorithms, potentially revolutionizing the existing RL paradigm that relies on manual reward engineering and enabling continuous performance improvements. The demo video and code can be accessed at: https://zilin-huang.github.io/VLM-RL-website.

翻译：近年来，基于强化学习（RL）的驾驶策略学习方法在自动驾驶领域受到越来越多的关注，并在多种驾驶场景中取得了显著进展。然而，传统的强化学习方法依赖于人工设计的奖励函数，这不仅需要大量人力，而且往往缺乏泛化能力。为了解决这些局限性，我们提出了 \textbf{VLM-RL}，一个将预训练的视觉语言模型（VLMs）与强化学习相统一的框架，该框架利用图像观测和自然语言目标来生成奖励信号。VLM-RL的核心是对比语言目标（CLG）即奖励范式，该范式使用正向和负向的语言目标来生成语义奖励。我们进一步引入了一种分层奖励合成方法，该方法将基于CLG的语义奖励与车辆状态信息相结合，从而提高了奖励的稳定性，并提供了更全面的奖励信号。此外，我们还采用了一种批处理技术来优化训练过程中的计算效率。在CARLA模拟器中进行的广泛实验表明，VLM-RL的性能优于最先进的基线方法，实现了碰撞率降低10.5%，路线完成率提升104.6%，并对未见过的驾驶场景展现出强大的泛化能力。此外，VLM-RL可以无缝集成几乎任何标准的强化学习算法，这有可能彻底改变当前依赖人工奖励设计的强化学习范式，并实现持续的性能提升。演示视频和代码可通过以下网址访问：https://zilin-huang.github.io/VLM-RL-website。