Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.
翻译:语言模型在医学问答方面表现出色,某些情况下甚至超过了普通医生的准确率。然而,由于可穿戴传感器产生的连续、高维、纵向数据难以与大型语言模型预训练中的文本中心化分布对齐,针对可穿戴健康数据的问答仍然具有挑战性且研究不足。传感器模态和用户意图的多样性无法通过固定的推理流程或单一的预训练基础模型有效处理。为解决这些问题,我们提出WEQA,一个查询自适应智能体框架,将大语言模型推理与专门的可穿戴数据分析和建模工具相结合。该框架采用大语言模型控制器合成执行计划,动态将每个查询路由至合适的传感器分析和预训练模型组合,并通过外部知识生成基于事实的响应审核。我们还构建了一个涵盖四个公开可穿戴数据集的基准测试,包括三个不同健康领域的分析和预测任务。实验表明,我们的框架比大语言模型和智能体基线方法的准确率高出24%,一项面向12名医学专家和8名用户的盲测研究显示其在实用性和临床合理性方面均有显著提升。