Passively collected behavioral health data from ubiquitous sensors holds significant promise to provide mental health professionals insights from patient's daily lives; however, developing analysis tools to use this data in clinical practice requires addressing challenges of generalization across devices and weak or ambiguous correlations between the measured signals and an individual's mental health. To address these challenges, we take a novel approach that leverages large language models (LLMs) to synthesize clinically useful insights from multi-sensor data. We develop chain of thought prompting methods that use LLMs to generate reasoning about how trends in data such as step count and sleep relate to conditions like depression and anxiety. We first demonstrate binary depression classification with LLMs achieving accuracies of 61.1% which exceed the state of the art. While it is not robust for clinical use, this leads us to our key finding: even more impactful and valued than classification is a new human-AI collaboration approach in which clinician experts interactively query these tools and combine their domain expertise and context about the patient with AI generated reasoning to support clinical decision-making. We find models like GPT-4 correctly reference numerical data 75% of the time, and clinician participants express strong interest in using this approach to interpret self-tracking data.
翻译:来自日常传感器被动收集的行为健康数据,有望为心理健康专业人员提供患者日常生活的洞察;然而,开发用于临床实践的分析工具需应对跨设备普适性挑战,以及测量信号与个体心理健康之间微弱或模糊的关联。为应对这些挑战,我们采用新颖方法,利用大型语言模型(LLMs)从多传感器数据中综合生成具有临床实用价值的洞察。我们开发了思维链提示方法,使LLMs能够就步数、睡眠等数据趋势与抑郁、焦虑等病症之间的关联进行推理。我们首先展示了LLMs在二分类抑郁识别中达到61.1%的准确率,超越了当前最优水平。尽管该精度尚不足以支撑临床使用,但这引出了我们的关键发现:比分类更具影响力和价值的是全新的人机协作模式——临床专家可交互式查询此类工具,将其领域知识与患者背景信息相结合,并借助AI生成的推理支持临床决策。我们发现GPT-4等模型在75%的情况下能正确引用数值数据,临床参与者对此方法解读自我追踪数据表现出浓厚兴趣。