We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses. In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of our approach is a tailored viewpoint selection such that the content of each image can be fused into a seamless, textured 3D mesh. More specifically, we propose a continuous alignment strategy that iteratively fuses scene frames with the existing geometry to create a seamless mesh. Unlike existing works that focus on generating single objects or zoom-out trajectories from text, our method generates complete 3D scenes with multiple objects and explicit 3D geometry. We evaluate our approach using qualitative and quantitative metrics, demonstrating it as the first method to generate room-scale 3D geometry with compelling textures from only text as input.
翻译:我们提出Text2Room方法,该方法能从给定文本提示作为输入生成房间尺度的纹理三维网格。为此,我们利用预训练的二维文本到图像模型从不同视角合成图像序列。为了将这些输出提升为一致的三维场景表示,我们将单目深度估计与文本条件式修复模型相结合。该方法的核心思路是设计特定的视点选择策略,使得每张图像的内容能够融合成无缝的纹理三维网格。具体而言,我们提出连续对齐策略,通过迭代融合场景帧与现有几何结构来创建无缝网格。与现有聚焦于从文本生成单一物体或推拉镜头轨迹的方法不同,我们的方法能生成包含多个物体与显式三维几何结构的完整三维场景。我们通过定性和定量指标评估该方法,证明其作为首个仅通过文本输入即可生成具有高质量纹理的房间尺度三维几何结构的方法。