Model-based Reinforcement Learning (MBRL) has emerged as a promising paradigm for autonomous driving, where data efficiency and robustness are critical. Yet, existing solutions often rely on carefully crafted, task specific extrinsic rewards, limiting generalization to new tasks or environments. In this paper, we propose InDRiVE (Intrinsic Disagreement based Reinforcement for Vehicle Exploration), a method that leverages purely intrinsic, disagreement based rewards within a Dreamer based MBRL framework. By training an ensemble of world models, the agent actively explores high uncertainty regions of environments without any task specific feedback. This approach yields a task agnostic latent representation, allowing for rapid zero shot or few shot fine tuning on downstream driving tasks such as lane following and collision avoidance. Experimental results in both seen and unseen environments demonstrate that InDRiVE achieves higher success rates and fewer infractions compared to DreamerV2 and DreamerV3 baselines despite using significantly fewer training steps. Our findings highlight the effectiveness of purely intrinsic exploration for learning robust vehicle control behaviors, paving the way for more scalable and adaptable autonomous driving systems.
翻译:基于模型的强化学习已成为自动驾驶领域一种前景广阔的范式,其中数据效率和鲁棒性至关重要。然而,现有解决方案通常依赖于精心设计的、任务特定的外在奖励,这限制了对新任务或新环境的泛化能力。本文提出 InDRiVE(基于内在分歧的车辆探索强化方法),该方法在基于 Dreamer 的 MBRL 框架内,利用纯粹内在的、基于分歧的奖励进行学习。通过训练一个世界模型集成,智能体能够在没有任何任务特定反馈的情况下,主动探索环境中的高不确定性区域。这种方法产生了一种与任务无关的潜在表示,使得能够在下游驾驶任务(如车道保持和碰撞避免)上实现快速的零样本或少样本微调。在已见和未见环境中的实验结果表明,与 DreamerV2 和 DreamerV3 基线相比,InDRiVE 尽管使用了显著更少的训练步数,仍实现了更高的成功率和更少的违规行为。我们的研究结果凸显了纯粹内在探索对于学习鲁棒车辆控制行为的有效性,为更具可扩展性和适应性的自动驾驶系统铺平了道路。