Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
翻译:大语言模型能否精准模拟特定用户的下一步网络行为?尽管大语言模型已在生成“可信”人类行为方面展现出潜力,但评估其模仿真实用户行为的能力仍是一项开放挑战,这主要源于缺乏能够同时捕捉真实人类用户可观测行为与内在推理过程的高质量公开数据集。为填补这一空白,我们提出OPeRA——一个从真实人类被试在线购物过程中收集的观察、人格、推理与行动新型数据集。OPeRA是首个全面涵盖用户人格画像、浏览器观测数据、细粒度网络行为以及即时自我报告推理过程的公开数据集。我们通过在线问卷和定制浏览器插件两种方式实现高保真数据采集。基于OPeRA,我们建立了首个基准测试,用于评估当前大语言模型在给定用户人格画像及<观察、行动、推理>历史记录后,预测特定用户下一步行为与推理过程的准确性。该数据集为未来旨在构建人类个性化数字孪生的大语言模型智能体研究奠定了基础。