Scientists and practitioners are increasingly moving to deploy digital twins--LLM-based models of real individuals--across social science and policy research. We conduct 19 pre-registered studies spanning 164 diverse outcomes (e.g., attitudes toward hiring algorithms, intentions to share misinformation), comparing human responses to those of their corresponding digital twins, which are trained on each individual's prior responses to over 500 questions. We establish an empirical benchmark for digital twin performance: their predictions are only modestly more accurate than those of a homogeneous base LLM and exhibit weak correlation with human responses (average $r = 0.20$). To inform future development, we identify five systematic distortions in digital twin behavior: (i) insufficient individuation, (ii) stereotyping, (iii) representation bias, (iv) ideological bias, and (v) hyper-rationality. Finally, we release our full dataset and code as a standardized testbed for evaluating and improving digital twin methodologies. Together, our findings caution against premature deployment while laying the groundwork for a transparent, replicable, and iterative science of responsible digital twin development.
翻译:暂无翻译