The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. Consequently, scholars have been actively investigating efficacious self-supervised pre-training paradigms. Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. A SiamWCA backbone, containing a Siamese encoder and a windowed cross-attention (WCA) module, is established for the two-frame input. Considering that the movement of an ego-vehicle alters the view of the same instance, temporal modeling also serves as a robust and natural data augmentation, enhancing the comprehension of target objects. SiamWCA is a powerful architecture but heavily relies on annotated data. Our T-MAE pre-training strategy alleviates its demand for annotated data. Comprehensive experiments demonstrate that T-MAE achieves the best performance on both Waymo and ONCE datasets among competitive self-supervised approaches.
翻译:激光雷达点云理解中标注数据的稀缺性阻碍了有效的表示学习。为此,学者们一直在积极探索高效的自监督预训练范式。然而,激光雷达点云序列中固有的时间信息却始终被忽视。为充分挖掘这一特性,我们提出了一种名为时间掩码自编码器(T-MAE)的有效预训练策略,该策略以时间相邻帧作为输入,学习时序依赖关系。针对双帧输入,我们构建了SiamWCA主干网络,其包含孪生编码器和窗口交叉注意力(WCA)模块。考虑到自车运动会改变同一实例的视角,时间建模本身也作为一种稳健且自然的数据增强手段,增强了对目标对象的理解。SiamWCA是一种强大的架构,但严重依赖标注数据。而我们的T-MAE预训练策略则降低了对标注数据的需求。综合实验表明,在Waymo和ONCE数据集上,T-MAE在竞争性的自监督方法中取得了最优性能。