Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

The rapid advancement of deep learning models often attributes to their ability to leverage massive training data. In contrast, such privilege has not yet fully benefited 3D deep learning, mainly due to the limited availability of large-scale 3D datasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However, due to the large domain gap between 3D point cloud datasets, such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e., negative transfer) compared to single-dataset training. In view of this challenge, we introduce Point Prompt Training (PPT), a novel framework for multi-dataset synergistic learning in the context of 3D representation learning that supports multiple pre-training paradigms. Based on this framework, we propose Prompt-driven Normalization, which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably, it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover, when served as a pre-training framework, it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3D scenarios.

翻译：深度学习模型的快速发展，往往归功于其利用大规模训练数据的能力。然而，这种优势尚未完全惠及三维深度学习领域，主要原因在于大规模三维数据集的可用性有限。合并多个可用的数据源，并让它们协同训练单一模型，是一种潜在的解决方案。但由于三维点云数据集之间存在较大的领域差距，这种混合监督方式可能对模型性能产生不利影响，并导致与单数据集训练相比的性能退化（即负迁移）。针对这一挑战，我们提出了点提示训练（Point Prompt Training, PPT），这是一种新颖的多数据集协同学习框架，用于支持多种预训练范式的三维表征学习。基于该框架，我们提出了提示驱动的归一化（Prompt-driven Normalization），它通过领域特定的提示使模型适应不同数据集，以及语言引导的分类对齐（Language-guided Categorical Alignment），该方法利用标签文本之间的关系以恰当方式统一多个数据集的标签空间。大量实验证明，PPT能够克服协同学习带来的负迁移，并生成可泛化的表征。值得注意的是，在使用单一权重共享模型进行多数据集有监督训练时，它在每个数据集上都达到了最先进的性能。此外，作为预训练框架时，它在表征质量上优于其他预训练方法，并在涵盖室内外三维场景的十余种不同下游任务中取得了显著的最先进性能。