Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Effective community governance hinges on understanding what specific residents think and need. Recent work has used large language models (LLMs) to simulate human respondents, offering a scalable, reproducible way to study human attitudes and behaviors at low cost. However, these studies typically prompt the model with just a few demographic variables (age, gender, income), simulating only general role types. This is insufficient for community governance, where decisions depend on the views of specific residents. We bridge this gap with an integrated research framework covering dataset, benchmark, algorithm, and system. The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. The benchmark probes 18 mainstream LLMs across four prompting strategies and shows that adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline, but this gain comes with more input tokens per call from the longer prompts they require. The algorithm, curriculum-LoRA, is a parameter-efficient personalization framework that, by closing this fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested. The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.

翻译：有效的社区治理依赖于理解具体居民的想法和需求。近期研究利用大语言模型模拟人类受访者，提供了一种可扩展、可复现的低成本研究人类态度与行为的方法。然而，这些研究通常仅使用少量人口统计学变量（年龄、性别、收入）对模型进行提示，仅模拟泛化的角色类型。这对于社区治理而言并不充分——因为决策依赖于特定居民的观点。我们通过一个涵盖数据集、基准、算法和系统的综合研究框架填补了这一空白。数据集包含约120万字符的第一人称叙事，通过对某城市社区92名居民进行每人两小时的半结构化访谈收集，并围绕九个社区治理领域组织。基准测试对18个主流大语言模型采用四种提示策略进行探查，结果表明：添加丰富的生活史档案能显著提升仿真保真度，超越无档案基线；但这一提升依赖于更长的提示词带来的额外输入令牌开销。算法方面，课程低秩适配（curriculum-LoRA）作为一个参数高效个性化框架，通过弥合保真度-成本差距，以每次调用成本降低约10倍的代价匹配最强基线的保真度，并在所有测试配置中实现帕累托最优。系统将课程低秩适配集成至闭环政策评估流水线。综合而言，这些成果使资源有限的地方行政机构得以实现基于大语言模型的个体层面居民仿真，从而在现实部署前通过数字孪生系统对社区治理决策进行系统性预评估。