WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?

IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.

翻译：基于惯性测量单元（IMU）的人体活动识别（HAR）已催生了广泛的普适计算应用，但其主流的片段分类范式无法捕捉现实世界行为中丰富的时间结构。这促使研究向IMU时间动作定位（IMU-TAL）转变，该任务旨在连续数据流中同时预测动作类别及其起止时间。然而，当前进展严重受限于对密集的帧级边界标注的需求，这些标注成本高昂且难以规模化。为应对这一瓶颈，我们提出了WS-IMUBench，一项在仅使用序列级标签条件下对弱监督IMU-TAL（WS-IMU-TAL）的系统性基准研究。我们并非提出新的定位算法，而是评估来自音频、图像和视频领域的成熟弱监督定位范式在仅使用序列级标签时迁移至IMU-TAL的效果。我们在七个公开IMU数据集上对七种代表性弱监督方法进行了基准测试，共计完成超过3,540次模型训练和7,080次推理评估。围绕可迁移性、有效性和启示三个研究问题，我们的研究发现表明：（i）迁移效果具有模态依赖性，时域方法通常比源自图像的基于候选片段的方法更稳定；（ii）在有利的数据集（例如动作较长、传感维度较高）上，弱监督方法可以具有竞争力；（iii）主要的失败模式源于短时动作、时间模糊性以及候选片段质量。最后，我们提出了推进WS-IMU-TAL发展的具体方向（例如，IMU特定的候选片段生成、边界感知目标函数以及更强的时间推理能力）。除具体结果外，WS-IMUBench建立了一个可复现的基准测试模板，包括数据集、协议和分析，以加速社区在可扩展WS-IMU-TAL方向上的整体进展。