LLM Powered Social Digital Twins: A Framework for Simulating Population Behavioral Response to Policy Interventions

Predicting how populations respond to policy interventions is a fundamental challenge in computational social science and public policy. Traditional approaches rely on aggregate statistical models that capture historical correlations but lack mechanistic interpretability and struggle with novel policy scenarios. We present a general framework for constructing Social Digital Twins - virtual population replicas where Large Language Models (LLMs) serve as cognitive engines for individual agents. Each agent, characterized by demographic and psychographic attributes, receives policy signals and outputs multi-dimensional behavioral probability vectors. A calibration layer maps aggregated agent responses to observable population-level metrics, enabling validation against real-world data and deployment for counterfactual policy analysis. We instantiate this framework in the domain of pandemic response, using COVID-19 as a case study with rich observational data. On a held-out test period, our calibrated digital twin achieves a 20.7% improvement in macro-averaged prediction error over gradient boosting baselines across six behavioral categories. Counterfactual experiments demonstrate monotonic and bounded responses to policy variations, establishing behavioral plausibility. The framework is domain-agnostic: the same architecture applies to transportation policy, economic interventions, environmental regulations, or any setting where policy affects population behavior. We discuss implications for policy simulation, limitations of the approach, and directions for extending LLM-based digital twins beyond pandemic response.

翻译：预测人口如何响应政策干预是计算社会科学与公共政策领域的一项基础性挑战。传统方法依赖于捕捉历史相关性的宏观统计模型，但缺乏机制可解释性，且难以应对新型政策情景。我们提出一个构建社会数字孪生的通用框架——该虚拟人口副本以大型语言模型作为个体智能体的认知引擎。每个具有人口统计学与心理特征属性的智能体接收政策信号，并输出多维行为概率向量。校准层将聚合的智能体响应映射至可观测的宏观人口指标，从而支持基于真实世界数据的验证，并可用于反事实政策分析。我们在疫情响应领域实例化了该框架，以拥有丰富观测数据的COVID-19作为案例进行研究。在保留测试时段内，经校准的数字孪生在六个行为类别上的宏观平均预测误差较梯度提升基线模型降低了20.7%。反事实实验表明，模型对政策变化呈现单调且有界的响应，验证了其行为合理性。该框架具有领域无关性：同一架构可适用于交通政策、经济干预、环境监管或任何政策影响群体行为的场景。我们讨论了该框架对政策模拟的启示、方法的局限性，以及将基于大语言模型的数字孪生拓展至疫情响应以外领域的发展方向。