Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.
翻译:连续血糖监测仪(CGM)在糖尿病护理中收集了丰富的个人健康数据,这些数据可改善日常自我管理。然而,现有的患者平台仅提供静态摘要,无法支持探究性的用户查询。大语言模型(LLM)虽能实现对连续血糖数据的自由形式查询,但将其部署于敏感健康记录时,会引发隐私和准确性方面的担忧。本文提出CGM-Agent,一种面向个人血糖数据的隐私保护问答框架。在我们的设计中,LLM纯粹作为选择分析函数的推理引擎运行。所有计算均在本地完成,个人健康数据绝不离开用户设备。为进行评估,我们构建了一个包含4180个问题的基准测试集,其中混合了参数化问题模板、真实用户查询以及基于确定性程序执行得到的真实值。对6个主流LLM的评估显示,顶级模型在合成查询上的值准确率达94%,在模糊的真实世界查询上达88%。错误主要源于意图歧义和时间模糊性,而非计算失败。此外,轻量级模型在我们的智能体设计中表现出竞争性性能,这为低成本部署提供了可能。我们公开了代码和基准测试集,以支持未来可信健康智能体的相关研究。