Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.
翻译:在公共部门部署基于大型语言模型的智能体(LLM智能体)时,必须确保其符合公共部门机构严格的法律、程序与结构性要求。从业者和研究人员通常借助基准测试进行此类评估。然而,当前仍不清楚基准测试需满足何种标准才能充分反映公共部门需求,亦无定论现有基准测试在多大程度上实现了这一目标。本文首先基于对公共行政文献的第一性原理调查,界定了以下标准:基准测试必须具备**流程导向性**、**现实性**、**公共部门专属性**,并需报告能反映公共部门独特需求的**度量指标**。我们通过专家验证的LLM辅助流程,对1300余篇基准测试论文进行了系统性分析。结果表明,目前没有任何单一基准测试能满足全部标准。本研究为研究者开发符合公共部门需求的基准测试提供了行动指引,同时建议公共部门官员在评估自身智能体应用案例时参考这些标准。