See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.

翻译：近期端到端自动驾驶的研究进展表明，基于从基础模型中提取的补丁对齐特征进行训练的策略，在分布外场景下展现出更好的泛化能力。我们假设，由于自注意力机制的作用，每个补丁特征都以不同的方式和强度隐式地嵌入/包含了所有其他补丁的信息，使得这些描述符具有高度冗余性。我们通过主成分分析和跨补丁相似性量化了此类（BLIP2）特征的冗余度：90%的方差可由17/64个主成分捕获，且强跨令牌相关性普遍存在。基于这种重叠信息进行训练会导致策略过度拟合虚假相关性，从而损害分布外鲁棒性。我们提出随机补丁选择，这是一种简单而有效的学习方法，用于训练更具鲁棒性、可泛化性和高效性的策略。对于每一帧，SPS随机掩码一部分补丁描述符，不将其输入策略模型，同时保留剩余补丁的空间布局。因此，策略获得的是（同一）场景的不同随机但完整的视图：每个随机补丁子集都如同对世界的一种不同但依然合理、连贯的投影。策略因此基于那些对具体哪些令牌得以保留保持不变的特征做出决策。大量实验证实，在所有分布外场景中，我们的方法均优于现有最佳技术，在闭环仿真中实现了平均6.2%的提升，最高可达20.4%，同时速度提升了2.4倍。我们对掩码率和补丁特征重组进行了消融研究，训练并评估了9个系统，其中8个超越了先前的SOTA。最后，我们展示了同一学习到的策略无需任何调整即可迁移到物理现实世界的真实车辆上。