Duoduo CLIP: Efficient 3D Understanding with Multi-View Images

We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, we modify the model with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility in use cases compared to point clouds. This includes being able to encode objects with a variable number of images, with better performance when more views are used. This is in contrast to point cloud based methods, where an entire scan or model of an object is required. We showcase this flexibility with object retrieval from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.

翻译：我们提出了Duoduo CLIP，一种用于三维表征学习的模型，它从多视角图像而非点云中学习形状编码。选择多视角图像使我们能够利用现成CLIP模型中的二维先验知识，以促进三维数据的微调。我们的方法不仅与现有点云方法相比展现出更好的泛化能力，还降低了对GPU的需求并缩短了训练时间。此外，我们通过跨视角注意力机制对模型进行了改进，以利用物体多个视角间的信息，从而进一步提升性能。与当前需要480 A100小时训练10亿参数的最先进点云方法相比，我们仅需57 A5000小时和8700万参数。与点云相比，多视角图像在使用场景上也提供了更大的灵活性。这包括能够使用可变数量的图像对物体进行编码，且使用的视角越多性能越好。这与基于点云的方法形成鲜明对比，后者需要物体的完整扫描或模型。我们通过从真实世界物体的图像中进行物体检索展示了这种灵活性。我们的模型在更细粒度的文本到形状检索任务中也取得了更好的性能，证明了其比基于点云的模型具有更好的文本-形状对齐能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日