In an era marked by the rapid scaling of foundation models, autonomous driving technologies are approaching a transformative threshold where end-to-end autonomous driving (E2E-AD) emerges due to its potential of scaling up in the data-driven manner. However, existing E2E-AD methods are mostly evaluated under the open-loop log-replay manner with L2 errors and collision rate as metrics (e.g., in nuScenes), which could not fully reflect the driving performance of algorithms as recently acknowledged in the community. For those E2E-AD methods evaluated under the closed-loop protocol, they are tested in fixed routes (e.g., Town05Long and Longest6 in CARLA) with the driving score as metrics, which is known for high variance due to the unsmoothed metric function and large randomness in the long route. Besides, these methods usually collect their own data for training, which makes algorithm-level fair comparison infeasible. To fulfill the paramount need of comprehensive, realistic, and fair testing environments for Full Self-Driving (FSD), we present Bench2Drive, the first benchmark for evaluating E2E-AD systems' multiple abilities in a closed-loop manner. Bench2Drive's official training data consists of 2 million fully annotated frames, collected from 13638 short clips uniformly distributed under 44 interactive scenarios (cut-in, overtaking, detour, etc), 23 weathers (sunny, foggy, rainy, etc), and 12 towns (urban, village, university, etc) in CARLA v2. Its evaluation protocol requires E2E-AD models to pass 44 interactive scenarios under different locations and weathers which sums up to 220 routes and thus provides a comprehensive and disentangled assessment about their driving capability under different situations. We implement state-of-the-art E2E-AD models and evaluate them in Bench2Drive, providing insights regarding current status and future directions.
翻译:在基础模型快速规模化发展的时代,自动驾驶技术正接近一个变革性临界点,其中端到端自动驾驶(E2E-AD)因其以数据驱动方式实现规模化的潜力而兴起。然而,现有的E2E-AD方法大多在开环日志回放模式下以L2误差和碰撞率作为指标(例如在nuScenes中)进行评估,正如学界近期所认识到的,这无法充分反映算法的实际驾驶性能。对于那些在闭环协议下评估的E2E-AD方法,它们通常在固定路线(如CARLA中的Town05Long和Longest6)中以驾驶分数作为指标进行测试,由于指标函数不平滑及长路线中的巨大随机性,该评估方式以高方差著称。此外,这些方法通常使用自行采集的数据进行训练,导致算法层面的公平比较难以实现。为满足全自动驾驶(FSD)对全面、真实且公平测试环境的迫切需求,我们提出了Bench2Drive——首个以闭环方式评估端到端自动驾驶系统多维能力的基准测试平台。Bench2Drive的官方训练数据集包含200万帧全标注数据,采集自CARLA v2中13638个短片段,这些片段均匀分布于44种交互场景(切入、超车、绕行等)、23种天气条件(晴天、雾天、雨天等)以及12类城镇环境(城市、村庄、大学区等)。其评估协议要求端到端自动驾驶模型在不同地点与天气条件下通过44种交互场景,总计220条测试路线,从而为不同情境下的驾驶能力提供全面且可分离的评估。我们实现了当前最先进的端到端自动驾驶模型并在Bench2Drive中进行评测,为领域现状与未来发展方向提供了重要洞见。