Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Effective community governance hinges on understanding what specific residents think and need. Recent work has used large language models (LLMs) to simulate human respondents, offering a scalable, reproducible way to study human attitudes and behaviors at low cost. However, these studies typically prompt the model with just a few demographic variables (age, gender, income), simulating only general role types. This is insufficient for community governance, where decisions depend on the views of specific residents. We bridge this gap with an integrated research framework covering dataset, benchmark, algorithm, and system. The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. The benchmark probes 18 mainstream LLMs across four prompting strategies and shows that adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline, but this gain comes with more input tokens per call from the longer prompts they require. The algorithm, curriculum-LoRA, is a parameter-efficient personalization framework that, by closing this fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested. The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.

翻译：有效社区治理取决于对特定居民想法和需求的准确理解。近期研究利用大型语言模型（LLM）模拟人类受访者，提供了一种可扩展、可复现且低成本的途径来研究人类态度与行为。然而，此类研究通常仅以少数人口统计变量（年龄、性别、收入）提示模型，仅能模拟一般角色类型。这对社区治理而言尚显不足——决策需依赖具体居民的观点。为弥合这一差距，我们提出涵盖数据集、基准、算法与系统的集成研究框架。数据集包含通过对某城市社区92名居民进行每人两小时半结构化访谈所收集的约120万字符第一人称叙事，围绕九个社区治理领域组织。基准测试涵盖18个主流LLM及四种提示策略，结果表明：添加丰富的生活史档案可显著提升模拟忠实度（高于无档案基线），但代价是每次调用需更长的提示词（因而增加输入token数）。算法方面，我们提出课程LoRA（curriculum-LoRA），一种参数高效个性化框架，通过弥合忠实度与成本之间的差距，在每次调用成本降低约10倍的情况下达到最强基线的忠实度，并在所有测试配置中实现帕累托最优。系统将课程LoRA集成至闭环政策评估管线。综合而言，这些成果使资源受限的地方行政机构能够实现基于LLM的个体层面居民模拟，从而在真实部署前对社区治理决策进行系统性数字预评估。