We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation.
翻译:我们提出JointNet,一种新型神经网络架构,用于建模图像与额外密集模态(如深度图)的联合分布。JointNet基于预训练的文本到图像扩散模型扩展而来,其中为新增的密集模态分支复制原始网络结构,并与RGB分支密集连接。在微调过程中,RGB分支保持锁定状态,从而在保持大规模预训练扩散模型强大泛化能力的同时,实现对新模态分布的高效学习。我们以RGBD扩散为例,通过大量实验验证了JointNet的有效性,展示了其在联合RGBD生成、密集深度预测、深度条件图像生成以及基于相干图块的3D全景生成等多种应用中的适用性。