Among the existing modalities for 3D action recognition, 3D flow has been poorly examined, although conveying rich motion information cues for human actions. Presumably, its susceptibility to noise renders it intractable, thus challenging the learning process within deep models. This work demonstrates the use of 3D flow sequence by a deep spatiotemporal model and further proposes an incremental two-level spatial attention mechanism, guided from skeleton domain, for emphasizing motion features close to the body joint areas and according to their informativeness. Towards this end, an extended deep skeleton model is also introduced to learn the most discriminant action motion dynamics, so as to estimate an informativeness score for each joint. Subsequently, a late fusion scheme is adopted between the two models for learning the high level cross-modal correlations. Experimental results on the currently largest and most challenging dataset NTU RGB+D, demonstrate the effectiveness of the proposed approach, achieving state-of-the-art results.
翻译:在现有的3D动作识别模态中,3D流虽能传递人类动作的丰富运动信息线索,但研究尚不充分。推测其易受噪声干扰,导致难以处理,从而给深度模型的学习过程带来挑战。本研究通过深度时空模型展示了3D流序列的使用,并进一步提出了一种由骨架领域引导的增量式两级空间注意力机制,用于强调靠近身体关节区域的运动特征及其信息量。为此,还引入了一个扩展的深度骨架模型,以学习最具判别性的动作运动动态,从而估计每个关节的信息量评分。随后,在两个模型之间采用后期融合方案,以学习高层跨模态相关性。在当前最大且最具挑战性的数据集NTU RGB+D上的实验结果表明,所提方法有效且达到了先进水平。