We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.
翻译:我们提出Point-Bind,这是一种三维多模态模型,可将点云与二维图像、语言、音频和视频进行对齐。在ImageBind的引导下,我们构建了三维与多模态之间的联合嵌入空间,从而实现了许多有前景的应用,例如任意模态到三维的生成、三维嵌入算术以及三维开放世界理解。在此基础上,我们进一步提出了Point-LLM,这是首个遵循三维多模态指令的三维大语言模型(LLM)。通过参数高效微调技术,Point-LLM将Point-Bind的语义注入预训练的LLM(如LLaMA)中,无需三维指令数据,却展现出卓越的三维及多模态问答能力。我们希望这项工作能为社区将三维点云扩展到多模态应用提供启发。代码已开源于https://github.com/ZiyuGuo99/Point-Bind_Point-LLM。