Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.
翻译:大型语言模型(LLMs)正日益融入现实决策领域,包括公共政策范畴。然而,它们理解和推理政策相关内容的能力仍未得到充分探索。为填补这一空白,我们提出 \textbf{\textit{PolicyBench}}——首个大规模跨系统基准(中美两国),用于评估政策理解能力。该基准涵盖21K个案例,涉及广泛政策领域,捕捉了现实治理的多样性与复杂性。遵循布卢姆教育目标分类学,该基准评估三项核心能力:(1) \textbf{记忆}:政策知识的事实性回忆,(2) \textbf{理解}:概念性与情境性推理,以及(3) \textbf{应用}:现实政策场景中的问题解决。基于此基准,我们进一步提出 \textbf{\textit{PolicyMoE}}——一个领域专用的混合专家模型,其专家模块与各认知层级对齐。所提出模型在面向应用的政策任务上表现优于记忆或概念理解任务,并在结构化推理任务上达到最高准确率。我们的研究结果揭示了当前LLMs在政策理解方面的关键局限性,并为开发更可靠、聚焦政策的模型指明了路径。