We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, we modify the model with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility in use cases compared to point clouds. This includes being able to encode objects with a variable number of images, with better performance when more views are used. This is in contrast to point cloud based methods, where an entire scan or model of an object is required. We showcase this flexibility with object retrieval from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.
翻译:我们提出了Duoduo CLIP,一种用于三维表征学习的模型,它从多视角图像而非点云中学习形状编码。选择多视角图像使我们能够利用现成CLIP模型中的二维先验知识,以促进三维数据的微调。我们的方法不仅与现有点云方法相比展现出更好的泛化能力,还降低了对GPU的需求并缩短了训练时间。此外,我们通过跨视角注意力机制对模型进行了改进,以利用物体多个视角间的信息,从而进一步提升性能。与当前需要480 A100小时训练10亿参数的最先进点云方法相比,我们仅需57 A5000小时和8700万参数。与点云相比,多视角图像在使用场景上也提供了更大的灵活性。这包括能够使用可变数量的图像对物体进行编码,且使用的视角越多性能越好。这与基于点云的方法形成鲜明对比,后者需要物体的完整扫描或模型。我们通过从真实世界物体的图像中进行物体检索展示了这种灵活性。我们的模型在更细粒度的文本到形状检索任务中也取得了更好的性能,证明了其比基于点云的模型具有更好的文本-形状对齐能力。