To empower the iterative assessments involved during a person's rehabilitation, automated assessment of a person's abilities during daily activities requires temporally precise segmentation of fine-grained actions in therapy videos. Existing temporal action segmentation (TAS) models struggle to capture sub-second micro-movements while retaining exercise context, blurring rapid phase transitions and limiting reliable downstream assessment of motor recovery. We introduce Multi-Membership Temporal Attention (MMTA), a high-resolution temporal transformer for fine-grained rehabilitation assessment. Unlike standard temporal attention, which assigns each frame a single attention context per layer, MMTA lets each frame attend to multiple locally normalized temporal attention windows within the same layer. We fuse these concurrent temporal views via feature-space overlap resolution, preserving competing local contexts near transitions while enabling longer-range reasoning through layer-wise propagation. This increases boundary sensitivity without additional depth or multi-stage refinement. MMTA supports both video and wearable IMU inputs within a unified single-stage architecture, making it applicable to both clinical and home settings. MMTA consistently improves over the Global Attention transformer, boosting Edit Score by +1.3 (Video) and +1.6 (IMU) on StrokeRehab while further improving 50Salads by +3.3. Ablations confirm that performance gains stem from multi-membership temporal views rather than architectural complexity, offering a practical solution for resource-constrained rehabilitation assessment.
翻译:为实现个体康复过程中的迭代评估,对日常活动中个体能力的自动化评估需要对治疗视频中的细粒度动作进行精确的时序分割。现有的时序动作分割模型难以在保持运动上下文的同时捕捉亚秒级的微动作,导致快速阶段转换模糊,限制了运动功能恢复的可靠下游评估。我们提出了多成员时序注意力机制,这是一种用于细粒度康复评估的高分辨率时序Transformer。与标准时序注意力机制(每层为每个帧分配单一注意力上下文)不同,MMTA允许每个帧在同一层内关注多个经过局部归一化的时序注意力窗口。我们通过特征空间重叠解析融合这些并发的时序视图,在过渡区域附近保留相互竞争的局部上下文,同时通过逐层传播实现更长距离的推理。这在不增加额外深度或多阶段精炼的情况下提高了边界敏感性。MMTA在统一的单阶段架构内同时支持视频和可穿戴IMU输入,使其适用于临床和家庭两种场景。在StrokeRehab数据集上,MMTA相比全局注意力Transformer持续提升,Edit Score分别提高+1.3(视频)和+1.6(IMU),并在50Salads数据集上进一步改善+3.3。消融实验证实性能提升源于多成员时序视图而非架构复杂性,为资源受限的康复评估提供了实用解决方案。