Autonomous driving is full of tiny social negotiations: a driver presses forward, another yields, a pedestrian fakes toward the curb, or a lane vehicle chooses whether to open a merge gap. Such interactions require inferring hidden intent from behavior under partial observability and then acting safely and efficiently. Existing autonomous-driving language benchmarks mostly focus on perception, visual question answering, or open-loop planning, while existing language-agent negotiation benchmarks typically make the negotiation explicit in text. Self-Driving Negotiator bridges the gap between the two: a text-only, multi-turn, procedurally generated environment for measuring implicit social coordination in driving. Agents generate specific driving actions. Reward and diagnostics are computed from the privileged simulator state, not from the explanation of the model. This report covers task design, reward and anti-gaming invariants, validated scenarios, non-LLM baselines, and a six-model inference leaderboard. Current models are far removed from the scripted expert. The best average success rate across three scenarios is 0.68; contested merge is statistically flat across models; and difficulty tiers separate cue-following from true wait-for-commitment behavior.
翻译:自动驾驶中充斥着微妙的社会协商:一位司机向前压车、另一位礼让、行人假意向路边移动、或车道车辆选择是否让出并线间隙。此类交互要求在部分可观测条件下从行为推断隐藏意图,进而安全高效地采取行动。现有自动驾驶语言基准主要聚焦于感知、视觉问答或开环规划,而现有语言智能体谈判基准通常将谈判过程显式呈现于文本中。自动驾驶谈判者弥合了两者间的鸿沟:一个纯文本、多轮次、程序化生成的环境,用于衡量驾驶中的隐性社会协调能力。智能体生成具体驾驶动作,奖励与诊断信息从特权模拟器状态计算得出,而非来自模型解释。本报告涵盖任务设计、奖励与防博弈不变性、验证场景、非大语言模型基线及六模型推理排行榜。当前模型与脚本专家差距显著,三个场景中最佳平均成功率为0.68;争议性并线场景中各模型表现统计学持平;难度层级将线索跟随行为与真正的等待承诺行为区分开来。