Personalized digital health support requires long-horizon, cross-dimensional reasoning over heterogeneous lifestyle signals, and recent advances in mobile sensing and large language models (LLMs) make such support increasingly feasible. However, the capabilities of current LLMs in this setting remain unclear due to the lack of systematic benchmarks. In this paper, we introduce LifeAgentBench, a large-scale QA benchmark for long-horizon, cross-dimensional, and multi-user lifestyle health reasoning, containing 22,573 questions spanning from basic retrieval to complex reasoning. We release an extensible benchmark construction pipeline and a standardized evaluation protocol to enable reliable and scalable assessment of LLM-based health assistants. We then systematically evaluate 11 leading LLMs on LifeAgentBench and identify key bottlenecks in long-horizon aggregation and cross-dimensional reasoning. Motivated by these findings, we propose LifeAgent as a strong baseline agent for health assistant that integrates multi-step evidence retrieval with deterministic aggregation, achieving significant improvements compared with two widely used baselines. Case studies further demonstrate its potential in realistic daily-life scenarios. The benchmark is publicly available at https://anonymous.4open.science/r/LifeAgentBench-CE7B.
翻译:个性化数字健康支持需要对异构生活方式信号进行长时程、跨维度的推理,而移动传感和大语言模型(LLM)的最新进展使得此类支持日益可行。然而,由于缺乏系统性基准,当前LLM在此场景下的能力仍不明确。本文提出LifeAgentBench,一个用于长时程、跨维度、多用户生活方式健康推理的大规模问答基准,包含22,573个从基础检索到复杂推理的问题。我们发布了可扩展的基准构建流程和标准化评估协议,以实现对基于LLM的健康助手进行可靠且可扩展的评估。随后,我们系统性地评估了11个主流LLM在LifeAgentBench上的表现,并识别了在长时程信息聚合和跨维度推理中的关键瓶颈。基于这些发现,我们提出了LifeAgent作为一个强基线健康助手智能体,它整合了多步证据检索与确定性聚合机制,相比两种广泛使用的基线方法取得了显著提升。案例研究进一步展示了其在真实日常生活场景中的应用潜力。该基准已公开于https://anonymous.4open.science/r/LifeAgentBench-CE7B。