The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.
翻译:大型语言模型(LLM)的快速发展加速了通用人工智能助手的研究进程。然而,现有针对个性化助手的评估基准仍与实际用户-助手交互场景存在偏差,未能充分捕捉外部环境复杂性与用户认知状态。为弥补这一差距,我们提出LifeSim——一种基于物理环境、采用信念-欲望-意图(BDI)认知模型生成连贯生活轨迹的用户模拟器,可模拟意图驱动的用户交互行为。基于LifeSim,我们构建了LifeSim-Eval:一个面向多场景、长时程个性化助手的综合评估基准。该基准涵盖8个生活领域与1,200个多样化场景,采用多轮交互式评估方法,系统检验模型在完成显性/隐性意图、还原用户画像及生成高质量回应等方面的能力。在单场景与长时程两种设定下的实验表明,当前LLM在处理隐性意图与长期用户偏好建模方面仍存在显著局限。