We present a new self-supervised paradigm on point cloud sequence understanding. Inspired by the discriminative and generative self-supervised methods, we design two tasks, namely point cloud sequence based Contrastive Prediction and Reconstruction (CPR), to collaboratively learn more comprehensive spatiotemporal representations. Specifically, dense point cloud segments are first input into an encoder to extract embeddings. All but the last ones are then aggregated by a context-aware autoregressor to make predictions for the last target segment. Towards the goal of modeling multi-granularity structures, local and global contrastive learning are performed between predictions and targets. To further improve the generalization of representations, the predictions are also utilized to reconstruct raw point cloud sequences by a decoder, where point cloud colorization is employed to discriminate against different frames. By combining classic contrast and reconstruction paradigms, it makes the learned representations with both global discrimination and local perception. We conduct experiments on four point cloud sequence benchmarks, and report the results on action recognition and gesture recognition under multiple experimental settings. The performances are comparable with supervised methods and show powerful transferability.
翻译:我们提出了一种新的点云序列理解自监督范式。受判别式和生成式自监督方法的启发,我们设计了两个任务,即基于点云序列的对比预测与重建(CPR),以协同学习更全面的时空表征。具体而言,首先将密集点云段输入编码器以提取嵌入向量。除最后一段外,其余所有段通过上下文感知自回归器进行聚合,以对最后的目标段进行预测。为了实现多粒度结构建模的目标,在预测与目标之间执行局部和全局对比学习。为进一步提升表征的泛化能力,还利用预测结果通过解码器重建原始点云序列,其中采用点云着色以区分不同帧。通过结合经典的对比与重建范式,使所学表征兼具全局判别性与局部感知能力。我们在四个点云序列基准数据集上开展实验,并在多种实验设置下报告动作识别与手势识别的结果。其性能可与监督方法相媲美,并展现出强大的可迁移性。