The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM-driven prompt-to-JSON pipeline generates semantically rich, simulation-ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule-based outcome labeling. To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large-scale evaluation of fifty leading LLMs on AgentDrive-MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics-grounded reasoning. We release the AgentDrive dataset, AgentDrive-MCQ benchmark, evaluation code, and related materials at https://github.com/maferrag/AgentDrive
翻译:大型语言模型(LLM)的快速发展推动了其在自主系统中进行推理驱动的感知、规划与决策的集成应用。然而,由于缺乏大规模、结构化且安全关键的基准测试,此类智能体AI模型的评估与训练仍面临挑战。本文提出AgentDrive,这是一个包含30万个LLM生成驾驶场景的开放基准数据集,专为在不同条件下训练、微调和评估自主智能体而设计。AgentDrive将场景空间沿七个正交维度进行形式化分解:场景类型、驾驶员行为、环境、道路布局、目标、难度和交通密度。通过LLM驱动的提示到JSON流水线生成语义丰富、可直接用于仿真的规范,并依据物理约束与模式约束进行验证。每个场景均经过仿真推演、代理安全指标计算和基于规则的结果标注。为补充基于仿真的评估,我们提出AgentDrive-MCQ——一个包含10万道选择题的基准测试,涵盖物理推理、策略推理、混合推理、场景推理和比较推理五个维度。我们在AgentDrive-MCQ上对五十个领先的LLM进行了大规模评估。结果表明,虽然专有前沿模型在情境推理和策略推理方面表现最佳,但先进开源模型在结构化推理和物理基础推理方面正迅速缩小差距。我们在https://github.com/maferrag/AgentDrive 公开了AgentDrive数据集、AgentDrive-MCQ基准、评估代码及相关材料。