Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.
翻译:深度学习已成为可穿戴人体活动识别(WHAR)领域的主导范式,然而进展因可比性危机而模糊不清。现有研究成果常采用不一致的数据集、自定义数据处理流程及差异化的评估协议进行报告,使得先进技术的主张缺乏可靠性。针对此问题,我们构建了一个大规模开源基准测试框架,该框架集成了30个不同数据集,采用标准化处理流程、统一模型接口及共享的跨受试者评估协议。通过4760次训练实验评估17种代表性架构,我们在安卓参考设备上同步测量预测性能与设备端延迟、峰值内存及模型规模。结果表明,WHAR领域的先进技术呈现分布式分布特征,而非由单一架构主导。尽管CNN-HAR取得了最高的平均宏F1分数,但性能最优的模型彼此紧密聚集,表明当代架构已趋近预测性能天花板。在考虑部署效率时,紧凑型神经模型(如TinierHAR)与经典随机森林共同定义了实际相关的帕累托前沿,而大型循环神经网络及混合模型则带来高昂硬件成本却未获得相应性能增益。鉴于此,尽管预测性能已进入平台期,但通过优化部署效率与提升领域适应能力仍存在巨大发展潜力。我们已开源完整框架以支持透明复用与扩展研究。