We introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts. ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape's geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.
翻译:我们提出ShapeWords,一种基于三维形状引导和文本提示的图像合成方法。ShapeWords将目标三维形状信息编码为特殊标记,与输入文本嵌入相结合,从而有效融合三维形状感知与文本语境以引导图像合成过程。与依赖固定视角深度图、常忽略完整三维结构或文本语境的传统形状引导方法不同,ShapeWords能生成既反映目标形状几何特征又符合文本描述的多样化且一致的图像。实验结果表明,ShapeWords生成的图像在更好遵循文本描述、保持视觉合理性的同时,仍具备三维形状感知能力。