Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?

LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.

翻译：基于大语言模型的数字孪生体有望扩展并加速市场调研，但现有大多数孪生体要么是基于少量人口统计问题构建的粗粒度角色机器人，要么是基于专门收集的问卷和访谈记录构建的细粒度个体孪生体。这两种方案均未涉及营销实践中操作性最相关的场景：即利用企业通过客户关系管理系统、忠诚度计划和重复调查已积累的现有异质性面板数据，构建细粒度个体孪生体。我们从德国社会经济面板数据中构建了细粒度个体孪生体，并基于一个涵盖三个开源大语言模型、按归一化香农熵排序的五级累积信息深度、两种嵌入方法和两种推理模式的$3 \times 5 \times 2 \times 2$构建方法网格对其进行评估，对500名参与者的183个保留问题评分了超过210万次孪生响应。孪生质量随信息深度提升，但超过75%熵分位数后边际收益递减，该分位数相对于表现最佳的100%细胞构成成本效益最优的帕累托点。在100%深度下，将嵌入方式从叙事性人格摘要切换为原始对话历史响应，在每个模型-推理组合的细胞中均提升了保留准确率，而显式推理模式在未改变准确率的情况下提升了秩相关系数。最佳细胞准确率达到78.8%，Fisher-$z$相关系数达到$r=0.590$（基于SOEP保留评估集）。研究结果表明，基于孪生体的市场研究不再受数据设计制约，而是受限于题项数量、模型选择及本文所映射的一小部分构建层面决策。