We describe Generative Blocks World to interact with the scene of a generated image by manipulating simple geometric abstractions. Our method represents scenes as assemblies of convex 3D primitives, and the same scene can be represented by different numbers of primitives, allowing an editor to move either whole structures or small details. Once the scene geometry has been edited, the image is generated by a flow-based method, which is conditioned on depth and a texture hint. Our texture hint takes into account the modified 3D primitives, exceeding the texture-consistency provided by existing techniques. These texture hints (a) allow accurate object and camera moves and (b) preserve the identity of objects. Our experiments demonstrate that our approach outperforms prior works in visual fidelity, editability, and compositional generalization.
翻译:我们提出生成式积木世界方法,通过操作简单的几何基元与生成图像的场景进行交互。该方法将场景表示为凸三维基元的集合,同一场景可由不同数量的基元表征,使编辑者既能移动完整结构,也能调整微小细节。场景几何编辑完成后,采用基于流的生成方法输出图像,该过程受深度图与纹理提示的联合约束。我们的纹理提示能感知修改后的三维基元,突破了现有技术在纹理一致性方面的局限。这类纹理提示能够(a)实现精准的物体与相机运动,同时(b)保持物体身份特征。实验表明,本方法在视觉保真度、可编辑性及组合泛化能力上均优于现有工作。