Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.
翻译:行为医疗保健风险评估由于患者数据的高度多模态特性以及情绪和情感障碍的时间动态性,仍然是一个具有挑战性的问题。尽管大型语言模型(LLMs)已展现出强大的推理能力,但它们在结构化临床风险评分中的有效性仍不明确。在本工作中,我们提出了HARBOR,一个具备行为健康感知能力的语言模型,旨在预测一个从-3(重度抑郁)到+3(躁狂)的整数尺度上的离散情绪与风险评分,称为Harbor风险评分(HRS)。我们还发布了PEARL,一个纵向行为医疗保健数据集,包含三名患者为期四年的月度观测数据,涵盖了生理、行为和自我报告的心理健康信号。我们在多种评估设置和消融实验中,对传统机器学习模型、专有LLMs以及HARBOR进行了基准测试。我们的结果表明,HARBOR优于经典基线模型和现成的LLMs,达到了69%的准确率,而逻辑回归的准确率为54%,最强的专有LLM基线准确率仅为29%。