Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving

In an era marked by the rapid scaling of foundation models, autonomous driving technologies are approaching a transformative threshold where end-to-end autonomous driving (E2E-AD) emerges due to its potential of scaling up in the data-driven manner. However, existing E2E-AD methods are mostly evaluated under the open-loop log-replay manner with L2 errors and collision rate as metrics (e.g., in nuScenes), which could not fully reflect the driving performance of algorithms as recently acknowledged in the community. For those E2E-AD methods evaluated under the closed-loop protocol, they are tested in fixed routes (e.g., Town05Long and Longest6 in CARLA) with the driving score as metrics, which is known for high variance due to the unsmoothed metric function and large randomness in the long route. Besides, these methods usually collect their own data for training, which makes algorithm-level fair comparison infeasible. To fulfill the paramount need of comprehensive, realistic, and fair testing environments for Full Self-Driving (FSD), we present Bench2Drive, the first benchmark for evaluating E2E-AD systems' multiple abilities in a closed-loop manner. Bench2Drive's official training data consists of 2 million fully annotated frames, collected from 10000 short clips uniformly distributed under 44 interactive scenarios (cut-in, overtaking, detour, etc), 23 weathers (sunny, foggy, rainy, etc), and 12 towns (urban, village, university, etc) in CARLA v2. Its evaluation protocol requires E2E-AD models to pass 44 interactive scenarios under different locations and weathers which sums up to 220 routes and thus provides a comprehensive and disentangled assessment about their driving capability under different situations. We implement state-of-the-art E2E-AD models and evaluate them in Bench2Drive, providing insights regarding current status and future directions.

翻译：在基础模型快速扩展的时代，自动驾驶技术正接近一个变革性门槛，端到端自动驾驶（E2E-AD）因其在数据驱动方式下扩展的潜力而崭露头角。然而，现有的E2E-AD方法大多在开环日志回放模式下评估，采用L2误差和碰撞率作为指标（例如在nuScenes中），这无法完全反映算法的驾驶性能，正如社区近期所认识到的。对于那些在闭环协议下评估的E2E-AD方法，它们在固定路线（例如CARLA中的Town05Long和Longest6）上测试，以驾驶得分作为指标，但由于指标函数不光滑以及长路线中的较大随机性，该指标以高方差著称。此外，这些方法通常收集自己的数据进行训练，这使得算法层面的公平比较难以实现。为了满足全自动驾驶（FSD）对全面、真实且公平测试环境的迫切需求，我们提出了Bench2Drive，这是首个在闭环方式下评估E2E-AD系统多种能力的基准测试。Bench2Drive的官方训练数据包含200万帧完全标注的数据，这些数据来自均匀分布于44种交互场景（如切入、超车、绕行等）、23种天气（晴天、雾天、雨天等）以及CARLA v2中12个城镇（城市、乡村、大学等）的10,000个短视频片段。其评估协议要求E2E-AD模型通过在不同地点和天气下的44种交互场景，总计220条路线，从而提供对其在不同情境下驾驶能力的全面且可解耦的评估。我们在Bench2Drive上实现了最先进的E2E-AD模型并进行了评估，提供了关于当前状态和未来方向的见解。