We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting.
翻译:我们提出Frankenstein——一种基于扩散的框架,能够单次生成语义合成的三维场景。与输出单一、统一三维形状的现有方法不同,Frankenstein同步生成多个分离的形状,每个形状对应一个语义上有意义的部件。三维场景信息被编码在单一的三平面张量中,可从中解码出多个有符号距离函数场,用于表示这些合成形状。在训练过程中,自编码器将三平面压缩至潜在空间,随后采用去噪扩散过程来逼近合成场景的分布。Frankenstein在生成具有自动分离部件的室内场景及人体化身方面展现出令人鼓舞的结果。生成的场景可支持多种下游应用,例如部分级纹理重映射、房间内物体重排或化身衣物迁移。