Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.
翻译:大语言模型能否准确模拟特定用户的下一步网页操作?尽管大语言模型在生成“可信”人类行为方面展现出巨大潜力,但评估其模仿真实用户行为的能力仍是一项开放性挑战,其主要瓶颈在于缺乏同时捕获实际人类用户的可观察行为与内在推理过程的高质量公开数据集。为填补这一空白,我们提出OPeRA——一个从真实人类参与者在在线购物过程中采集的观察、人格、理由与行为新型数据集。OPeRA是首个全面捕获用户画像、浏览器观测数据、细粒度网页操作及即时自我报告理由的公开数据集。我们开发了在线问卷与定制浏览器插件,以高保真度采集该数据集。基于OPeRA,我们建立了首个基准测试,系统评估当前大语言模型在给定用户画像及<观察、行为、理由>历史记录下,预测特定用户下一步操作与理由的能力。该数据集为探索以个性化数字孪生为目标的LLM智能体研究奠定了基础。