Large language models (LLMs) are increasingly used to simulate human behavior, but their ability to simulate $individual$ privacy decisions is not well understood. In this paper, we address the problem of evaluating whether a core set of user persona attributes can drive LLMs to simulate individual-level privacy behavior. We introduce PrivacySIM, an evaluation suite that benchmarks LLM simulation of user privacy behavior against the ground-truth responses of 1,000 users. These users are drawn from five published user studies on privacy spanning LLM healthcare consultations, conversational agents, and chatbots. Drawing on these user studies, we hypothesize three persona facets as plausible predictors of privacy decision-making: demographics, previous experiences, and stated privacy attitudes. We condition nine frontier LLMs on subsets of these three facets and measure how often each model's response to a data-sharing scenario matches the user's actual response. Our findings show that (1) privacy persona conditioning consistently improves simulation quality over no-persona conditioning, but even the strongest model (40.4\% accuracy) remains far from faithfully simulating individual privacy decisions. (2) A user's stated privacy attitudes alone may not be the best predictor because they often diverge from the user's actual privacy behavior. (3) Users with high AI/chatbot experience but low stated privacy attitudes are the most challenging to simulate. PrivacySIM is a first step toward understanding and improving the capabilities of LLMs to simulate user privacy decisions. We release PrivacySIM to enable further evaluation of LLM privacy simulation.
翻译:摘要:大语言模型(LLMs)日益被用于模拟人类行为,但其对个体隐私决策的模拟能力尚不明确。本文旨在解决以下问题:一组核心用户画像属性能否驱动LLMs模拟个体层面的隐私行为。我们提出PrivacySIM,一个评估套件,将LLM对用户隐私行为的模拟结果与1000名用户的真实响应进行基准测试。这些用户来自五项已发表的隐私用户研究,涵盖LLM医疗咨询、对话代理和聊天机器人。基于这些研究,我们假设三个画像维度可作为隐私决策的合理预测因子:人口统计学特征、过往经历和陈述性隐私态度。我们在九个前沿LLM上分别基于这三个维度的子集进行条件设定,并测量每个模型对数据共享场景的响应与用户实际响应的一致频率。研究结果表明:(1)相比于无画像条件设定,隐私画像条件设定持续提升模拟质量,但即使是最优模型(准确率40.4%)仍远未达到忠实模拟个体隐私决策的水平;(2)用户陈述性隐私态度单独作为预测因子可能不够理想,因其常与实际隐私行为存在偏差;(3)AI/聊天机器人经验丰富但陈述性隐私态度低下的用户群体最难模拟。PrivacySIM是理解并提升LLM模拟用户隐私决策能力的第一步。我们已开源PrivacySIM以支持对LLM隐私模拟的进一步评估。