Neural Radiance Fields (NeRFs) have emerged as a standard framework for representing 3D scenes and objects, introducing a novel data type for information exchange and storage. Concurrently, significant progress has been made in multimodal representation learning for text and image data. This paper explores a novel research direction that aims to connect the NeRF modality with other modalities, similar to established methodologies for images and text. To this end, we propose a simple framework that exploits pre-trained models for NeRF representations alongside multimodal models for text and image processing. Our framework learns a bidirectional mapping between NeRF embeddings and those obtained from corresponding images and text. This mapping unlocks several novel and useful applications, including NeRF zero-shot classification and NeRF retrieval from images or text.
翻译:神经辐射场(Neural Radiance Fields, NeRFs)已成为表示3D场景与物体的标准框架,并引入了一种用于信息交换与存储的新型数据类型。与此同时,文本与图像模态的多模态表示学习也取得了显著进展。本文探索了一个旨在连接NeRF模态与其他模态的新研究方向,类似于图像与文本之间已有的方法体系。为此,我们提出一个简洁框架,该框架利用预训练的NeRF表示模型以及用于文本与图像处理的多模态模型。我们的框架学习NeRF嵌入与对应图像及文本嵌入之间的双向映射。这一映射解锁了多项新颖且实用的应用,包括NeRF零样本分类以及基于图像或文本的NeRF检索。