Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a Sparse signalguided Transformer (SigFormer) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.
翻译:多模态人体动作分割是一项关键且具挑战性的任务,应用广泛。当前,大多数方法集中于密集信号(如RGB、光流和深度图)的融合。然而,稀疏物联网传感器信号的潜在贡献尚未得到充分挖掘,而这些信号可能对实现精确识别至关重要。为弥补这一不足,我们提出了一种稀疏信号引导的Transformer(SigFormer),以同时融合密集信号与稀疏信号。我们采用掩码注意力机制,通过将交叉注意力限制在稀疏信号有效的区域来融合局部特征。然而,由于稀疏信号是离散的,缺乏足够的时间动作边界信息。因此,在SigFormer中,我们提出在两个阶段强调边界信息以缓解此问题。在第一特征提取阶段,我们引入中间瓶颈模块(intermediate bottleneck module),通过内部损失函数联合学习每种密集模态的类别特征与边界特征。在密集模态与稀疏信号融合后,我们进一步设计了一个双分支架构,显式建模动作类别与时间边界之间的相互关系。实验结果表明,在来自真实工业环境的多模态动作分割数据集上,SigFormer优于现有最先进方法,取得了0.958的出色F1分数。代码与预训练模型已开源在https://github.com/LIUQI-creat/SigFormer。