Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

翻译：自动驾驶中的分布外（OOD）鲁棒性常被简化为单一数值，从而掩盖了策略失效的具体原因。我们沿五个维度分解环境：场景（乡村/城市）、季节、天气、时间（日/夜）以及交通参与者构成；并在受控的 $k$ 因子扰动（$k \in \{0,1,2,3\}$）下测量性能。通过在 VISTA 仿真器中进行闭环控制，我们对全连接（FC）、卷积神经网络（CNN）和视觉 Transformer（ViT）策略进行基准测试，在冻结的基础模型（FM）特征上训练紧凑的 ViT 头部，并改变分布内（ID）数据在规模、多样性和时序上下文上的支持度。（1）ViT 策略的 OOD 鲁棒性显著优于同等规模的 CNN/FC 策略，且 FM 特征能以延迟为代价实现最先进的成功率。（2）简单的时序输入（多帧）未能超越最佳单帧基线。（3）最大的单因子性能下降来自乡村 $\rightarrow$ 城市和日间 $\rightarrow$ 夜间（各约 $31\%$）；交通参与者替换约 $10\%$，中等降雨约 $7\%$；季节变化可能导致剧烈下降，而时间翻转与其他变化结合会进一步降低性能。（4）基于 FM 特征的策略在三个变化同时发生时仍能保持 $85\%$ 以上的成功率；非 FM 的单帧策略在首次变化时即遭受大幅性能损失，且所有非 FM 模型在三个变化后成功率均低于 $50\%$。（5）因子间交互作用非加和：某些组合部分抵消，而季节与时间的组合尤其有害。（6）在冬季/雪天数据上训练对单因子变化最具鲁棒性，而以乡村+夏季为基线则能获得最佳的整体 OOD 性能。（7）增加训练轨迹/视角可提升鲁棒性（从 $5$ 条到 $14$ 条轨迹带来 $+11.8$ 个百分点的提升），但针对性暴露于困难条件可替代规模扩张。（8）使用多个 ID 环境能拓宽覆盖范围并强化薄弱环节（城市 OOD 成功率从 $60.6\%$ 提升至 $70.1\%$），同时仅带来小幅 ID 性能下降；单一 ID 环境虽能保持峰值性能，但适用范围狭窄。这些结果为设计具有 OOD 鲁棒性的驾驶策略提供了可操作的指导原则。