End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.
翻译:端到端自动驾驶在基准测试和实际部署中已取得最先进性能。然而,其标准训练流程在各个阶段均成本高昂:采集和标注数百万驾驶帧需要大量投入,基于图像的闭环强化学习受限于每步成本——包括逼真渲染及大型视觉骨干网络的前向传播。向量化模拟器中的自我博弈改变了经济模式:每秒可生成数百万次回合,且状态分布自然富含碰撞、险情与恢复场景,这是任何驾驶日志所无法包含的。我们的方法利用这种非对称性,将驾驶学习与视觉学习解耦。首先通过自我博弈预训练单一策略,然后借助动作KL散度与批量关系型低秩结构损失,将其潜在空间与预训练的视觉骨干网络对齐。动作目标源自自我博弈策略,因此对齐过程无需对记录轨迹进行监督:仅需(图像,场景状态)帧的配对数据集即可,无需模仿预训练所依赖的精心策划的专家示范。在逼真的3D高斯泼溅闭环场景中,最终端到端策略的性能达到或超越先前的端到端方法。