A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats. If correct, these simulations could offer a scalable way to forecast S&P risks in products prior to deployment. We interrogate this assumption using SP-ABCBench, a new benchmark of 30 tests derived from validated S&P human-subject studies, which measures alignment between simulations and human-subjects studies on a 0-100 ascending scale, where higher scores indicate better alignment across three dimensions: Attitude, Behavior, and Coherence. Evaluating twelve LLMs, four persona construction strategies, and two prompting methods, we found that there remains substantial room for improvement: all models score between 50 and 64 on average. Newer, bigger, and smarter models do not reliably do better and sometimes do worse. Some simulation configurations, however, do yield high alignment: e.g., with scores above 95 for some behavior tests when agents are prompted to apply bounded rationality and weigh privacy costs against perceived benefits. We release SP-ABCBench to enable reproducible evaluation as methods improve.
翻译:越来越多的研究假设大型语言模型(LLM)智能体可以作为人类在面对安全与隐私(S&P)威胁时形成态度和采取行为的代理。如果这一假设成立,此类模拟将能在产品部署前提供一种可扩展的预测S&P风险的方法。我们通过SP-ABCBench这一新基准来审视该假设。该基准包含30项源自经过验证的S&P人类主体研究的测试,采用0-100递增量表(分数越高表示一致性越好)从三个维度衡量模拟与人类主体研究的一致性:态度、行为和连贯性。通过评估十二种LLM、四种角色构建策略和两种提示方法,我们发现仍有显著的改进空间:所有模型平均得分在50至64之间。更新、更大、更智能的模型并未表现出稳定优势,有时甚至表现更差。然而,某些模拟配置确实能实现高度一致性:例如,当智能体被提示应用有限理性并权衡隐私成本与感知收益时,部分行为测试的得分超过95。我们发布SP-ABCBench,以支持在方法改进过程中进行可复现的评估。