While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.
翻译:尽管三维内容生成已取得显著进展,现有方法在输入格式、隐空间设计与输出表示方面仍面临挑战。本文提出一种新颖的三维生成框架,通过构建交互式点云结构隐空间,解决了上述问题,实现了可扩展的高质量三维生成。本框架采用以多视角位姿RGB-D(epth)-N(ormal)渲染为输入的变分自编码器(VAE),其独特的隐空间设计能保持三维形状信息,并引入级联隐扩散模型以提升形状-纹理解耦能力。所提出的GaussianAnything方法支持多模态条件三维生成,可接受点云、文本描述及单/多视角图像作为输入。值得注意的是,新设计的隐空间天然支持几何-纹理解耦,从而实现三维感知编辑。在多个数据集上的实验结果表明,本方法在文本与图像条件三维生成任务中均优于现有方法。