Synthetic data offers a compelling path to scalable pretraining when real-world data is scarce, but models pretrained on synthetic data often fail to transfer reliably to deployment settings. We study this problem in full-body human motion, where large-scale data collection is infeasible but essential for wearable-based Human Activity Recognition (HAR), and where synthetic motion can be generated from motion-capture-derived representations. We pretrain motion time-series models using such synthetic data and evaluate their transfer across diverse downstream HAR tasks. Our results show that synthetic pretraining improves generalisation when mixed with real data or scaled sufficiently. We also demonstrate that large-scale motion-capture pretraining yields only marginal gains due to domain mismatch with wearable signals, clarifying key sim-to-real challenges and the limits and opportunities of synthetic motion data for transferable HAR representations.
翻译:当真实世界数据稀缺时,合成数据为可扩展的预训练提供了引人注目的路径,但基于合成数据预训练的模型往往无法可靠地迁移至实际部署场景。本研究针对全身人体运动领域探讨此问题——该领域大规模数据采集虽不可行却对基于可穿戴设备的人类活动识别至关重要,且合成运动可通过运动捕捉衍生的表征生成。我们使用此类合成数据预训练运动时间序列模型,并评估其在多样化下游HAR任务中的迁移性能。结果表明:当与真实数据混合或达到足够规模时,合成预训练能有效提升泛化能力。我们同时证明,由于与可穿戴信号存在领域失配,大规模运动捕捉预训练仅带来边际收益,从而阐明了仿真到现实的关键挑战,以及合成运动数据在可迁移HAR表征方面的局限与机遇。