Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.

翻译：问题、研究策略与发现：大型语言模型的兴起对城市规划领域提出了一个关键问题：哪些专业规划知识可由人工智能复制，哪些仍需人类判断？尽管人工智能工具在规划实践中日益普及，目前仍缺乏系统性框架来检验其能否以规划专长核心的语境敏感性、价值意识与制度素养进行推理。本文提出“城市计划基准”（UPBench）——一个基于修订版布鲁姆分类学的4×5矩阵（包含四大知识支柱与五个认知层级）的领域特定评估框架，用于评估大型语言模型的推理能力。通过对25个大型语言模型进行自动评分与专家评审，我们发现一条非单调认知曲线：模型在高阶分析任务上的表现优于事实记忆与整合判断。这表明常被视为低阶的规划知识深受制度、管辖与时间语境的影响，导致大型语言模型难以泛化。我们将这些限制归纳为四类认知诊断：监管幻觉、概念混淆、复杂性瘫痪与实践智慧缺失。实践启示：研究结果支持规划中的差异化授权。大型语言模型可辅助跨学科综合、文献综述、情景生成与初步政策分析，但在管辖特定法规、规范冲突解决与语境敏感程序方面仍不可靠。机构应对人工智能辅助的监管分析结果进行验证，而规划教育应强化制度素养、规范判断与语境敏感性。