Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.
翻译:点云视频理解对机器人技术至关重要,因为它能准确编码运动与场景交互。我们认识到4D数据集远少于3D数据集,这限制了自监督4D模型的可扩展性。一个可行的替代方案是将3D预训练模型迁移到4D感知任务。然而,严格的实证分析揭示了两个阻碍迁移能力的关键限制:过拟合和模态差距。为克服这些挑战,我们提出了一种新颖的"先对齐后适配"(PointATA)范式,将参数高效迁移学习分解为两个连续阶段。采用最优传输理论量化3D与4D数据集间的分布差异,使得我们提出的点对齐嵌入器在第一阶段得到训练,以缓解潜在的模态差距。为减轻过拟合,在第二阶段将高效的点视频适配器与空间上下文编码器集成到冻结的3D骨干网络中,以增强时序建模能力。值得注意的是,凭借上述工程导向设计,PointATA使缺乏时序知识的预训练3D模型能够以相比先前工作更低的参数成本推理动态视频内容。大量实验表明,PointATA可匹配甚至超越强全微调模型,同时享受参数效率优势,例如在3D动作识别上达97.21%准确率,在4D动作分割上提升+8.7%,在4D语义分割上达84.06%。