Evidence for Phenotype-Driven Disparities in Freezing of Gait Detection and Approaches to Bias Mitigation

Freezing of gait (FOG) is a debilitating symptom of Parkinson's disease (PD) and a common cause of injurious falls. Recent advances in wearable-based human activity recognition (HAR) enable FOG detection, but bias and fairness in these models remain understudied. Bias refers to systematic errors leading to unequal outcomes, while fairness refers to consistent performance across subject groups. Biased models could systematically underserve patients with specific FOG phenotypes or demographics, potentially widening care disparities. We systematically evaluated bias and fairness of state-of-the-art HAR models for FOG detection across phenotypes and demographics using multi-site datasets. We assessed four mitigation approaches: conventional methods (threshold optimization and adversarial debiasing) and transfer learning approaches (multi-site transfer and fine-tuning large pretrained models). Fairness was quantified using demographic parity ratio (DPR) and equalized odds ratio (EOR). HAR models exhibited substantial bias (DPR & EOR < 0.8) across age, sex, disease duration, and critically, FOG phenotype. Phenotype-specific bias is particularly concerning as tremulous and akinetic FOG require different clinical management. Conventional bias mitigation methods failed: threshold optimization (DPR=-0.126, EOR=+0.063) and adversarial debiasing (DPR=-0.008, EOR=-0.001) showed minimal improvement. In contrast, transfer learning from multi-site datasets significantly improved fairness (DPR=+0.037, p<0.01; EOR=+0.045, p<0.01) and performance (F1-score=+0.020, p<0.05). Transfer learning across diverse datasets is essential for developing equitable HAR models that reliably detect FOG across all patient phenotypes, ensuring wearable-based monitoring benefits all individuals with PD.

翻译：步态冻结（FOG）是帕金森病（PD）的一种致残性症状，也是导致伤害性跌倒的常见原因。基于可穿戴设备的人类活动识别（HAR）技术的最新进展使得FOG检测成为可能，但这些模型中的偏差与公平性问题仍未得到充分研究。偏差指导致不平等结果的系统性误差，而公平性则指模型在不同受试者群体间性能的一致性。存在偏差的模型可能系统性地忽视具有特定FOG表型或人口统计学特征的患者，从而可能加剧医疗护理差异。本研究利用多中心数据集，系统评估了最先进的FOG检测HAR模型在表型与人口统计学维度上的偏差与公平性。我们评估了四种缓解方法：传统方法（阈值优化与对抗性去偏差）和迁移学习方法（多中心迁移与微调大型预训练模型）。公平性通过人口统计均等比率（DPR）和机会均等比率（EOR）进行量化。HAR模型在年龄、性别、病程以及关键的FOG表型维度上均表现出显著偏差（DPR与EOR < 0.8）。表型特异性偏差尤其值得关注，因为震颤型与运动不能型FOG需要不同的临床管理策略。传统偏差缓解方法效果有限：阈值优化（DPR=-0.126，EOR=+0.063）与对抗性去偏差（DPR=-0.008，EOR=-0.001）仅带来微弱改善。相比之下，基于多中心数据集的迁移学习显著提升了公平性（DPR=+0.037，p<0.01；EOR=+0.045，p<0.01）与性能（F1分数=+0.020，p<0.05）。跨多样化数据集的迁移学习对于开发公平的HAR模型至关重要，这些模型需能可靠检测所有患者表型的FOG，从而确保基于可穿戴设备的监测技术能使所有PD患者受益。