Learning view-invariant representation is a key to improving feature discrimination power for skeleton-based action recognition. Existing approaches cannot effectively remove the impact of viewpoint due to the implicit view-dependent representations. In this work, we propose a self-supervised framework called Focalized Contrastive View-invariant Learning (FoCoViL), which significantly suppresses the view-specific information on the representation space where the viewpoints are coarsely aligned. By maximizing mutual information with an effective contrastive loss between multi-view sample pairs, FoCoViL associates actions with common view-invariant properties and simultaneously separates the dissimilar ones. We further propose an adaptive focalization method based on pairwise similarity to enhance contrastive learning for a clearer cluster boundary in the learned space. Different from many existing self-supervised representation learning work that rely heavily on supervised classifiers, FoCoViL performs well on both unsupervised and supervised classifiers with superior recognition performance. Extensive experiments also show that the proposed contrastive-based focalization generates a more discriminative latent representation.
翻译:学习视角不变表示是提升基于骨骼动作识别特征判别能力的关键。现有方法由于隐式视角相关表示的存在,无法有效消除视角影响。本文提出一种名为聚焦对比视角不变学习(FoCoViL)的自监督框架,该框架在视角粗略对齐的表示空间中显著抑制了视角特定信息。通过在多视角样本对之间利用有效对比损失最大化互信息,FoCoViL将动作与公共视角不变属性相关联,同时分离不相似的动作。我们进一步提出一种基于成对相似度的自适应聚焦方法,以增强对比学习,从而在学习空间中形成更清晰的聚类边界。与许多严重依赖监督分类器的现有自监督表示学习方法不同,FoCoViL在无监督和监督分类器上均表现优异,具有卓越的识别性能。大量实验还表明,所提出的基于对比的聚焦方法能够生成更具判别性的潜在表示。