Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
翻译:基于文本的3D场景生成与编辑具有通过直观用户交互简化内容创作的重要潜力。尽管近期研究利用3D高斯溅射(3DGS)实现了高保真度与实时渲染,现有方法往往专精于特定任务,缺乏统一的生成与编辑框架。本文提出SplatFlow,一个通过直接实现3DGS生成与编辑来填补这一空白的综合性框架。SplatFlow包含两个核心组件:多视角整流流模型与高斯溅射解码器。多视角整流流模型在潜空间运行,根据文本提示同步生成多视角图像、深度图及相机位姿,从而解决了真实场景中多样化的场景尺度与复杂相机轨迹等挑战。随后,高斯溅射解码器通过前馈式3DGS方法,高效地将这些潜空间输出转换为3DGS表示。借助免训练的反演与修复技术,SplatFlow实现了无缝的3DGS编辑,并在统一框架内支持广泛的3D任务——包括物体编辑、新视角合成与相机位姿估计——无需额外复杂流程。我们在MVImgNet与DL3DV-7K数据集上验证了SplatFlow的性能,证明了其在多种3D生成、编辑及基于修复任务中的通用性与有效性。