A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60\%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

翻译：可穿戴设备和智能手机生成的丰富行为时间序列可支持主动健康干预，然而针对这些数据，现代预测架构的系统性比较尚属空白。尤其不清楚模型在不同人群中的泛化能力、不同架构对参与者级微调的响应差异，以及预测精度在多日时域上的衰减规律。我们基于三个涵盖800余名参与者的公开数据集，对六种深度学习架构、两种零样本基础模型（FM）及统计基线进行基准测试，报告了步数、屏幕使用时间和睡眠时长在1-8天时域上的逐特征指标。进一步，我们对全部六种架构开展逐特征个性化研究，并评估了基础模型在数据集规模和时间粒度上的迁移能力。主要发现如下：（i）无单一架构占据绝对优势，PatchTST在已训练模型中表现最佳，而紧随其后的三种模型（TCN、MLP、Transformer）无显著性能差异；（ii）基础模型TimesFM在零样本场景下可与已训练模型媲美甚至超越，尤其在小数据场景中；（iii）参与者级微调使逐特征均方根误差降低16-60%，其中睡眠指标改善最显著，步数指标改善最弱。这些结果为移动健康预测的架构选择、基础模型适用性及个性化策略提供了实用指南。据我们所知，这是首个联合评估现代深度学习、基础模型及个性化方法进行可穿戴设备多时域行为预测的研究。