Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.
翻译:点云数据的基础模型近年来能力不断增强,通常依赖于从语言或视觉中进行广泛的表示学习。在本工作中,我们采用了一种更为可控的方法,引入了一种基于Transformer的轻量级点云架构。与严重依赖跨模态监督不同,我们的模型仅在39k个点云上进行训练——但其性能却优于多个在超过200k训练样本上训练的更大规模基础模型。有趣的是,我们的方法接近那些已见过超过百万个点云、图像和文本样本的模型所达到的最先进结果,这证明了精心设计的训练设置和架构的价值。为确保严谨评估,我们进行了一项全面的复现研究,该研究标准化了训练方案,并在多种点云架构上进行了基准测试。这一统一的实验框架隔离了架构选择的影响,从而实现了透明的比较,并突显了我们设计及其他无分词器架构的优势。我们的结果表明,简单的骨干网络能够取得与更复杂或数据更丰富的策略相竞争的结果。实现细节,包括代码、预训练模型和训练协议,可在 https://github.com/KonradSzafer/Pointy 获取。