Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control

Recent advances in text-driven 3D scene editing and stylization, which leverage the powerful capabilities of 2D generative models, have demonstrated promising outcomes. However, challenges remain in ensuring high-quality stylization and view consistency simultaneously. Moreover, applying style consistently to different regions or objects in the scene with semantic correspondence is a challenging task. To address these limitations, we introduce techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer. Our method achieves stylization by re-training an initial 3D representation using stylized multi-view 2D images of the source views. Therefore, ensuring both style consistency and view consistency of stylized multi-view images is crucial. We achieve this by extending the style-aligned depth-conditioned view generation framework, replacing the fully shared attention mechanism with a single reference-based attention-sharing mechanism, which effectively aligns style across different viewpoints. Additionally, inspired by recent 3D inpainting methods, we utilize a grid of multiple depth maps as a single-image reference to further strengthen view consistency among stylized images. Finally, we propose Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models. We demonstrate that this optional feature enhances the faithfulness of style transfer and enables the mixing of different styles across distinct regions of the scene. Experimental evaluations, both qualitative and quantitative, demonstrate that our pipeline effectively improves the results of text-driven 3D stylization. Project Page: https://haruolabs.github.io/improved-gs-style-page/

翻译：近年来，利用二维生成模型强大能力的文本驱动三维场景编辑与风格化方法已展现出有前景的结果。然而，在同时确保高质量风格化与视角一致性方面仍存在挑战。此外，以语义对应方式将风格一致地应用于场景中不同区域或对象是一项困难的任务。为应对这些局限，我们引入了一系列技术，在保持视角一致性的同时提升三维风格化的质量，并提供可选的区域控制风格迁移。我们的方法通过对源视角的风格化多视图二维图像重新训练初始三维表示来实现风格化。因此，确保风格化多视图图像的风格一致性与视角一致性至关重要。我们通过扩展风格对齐的深度条件视图生成框架实现这一目标，将完全共享的注意力机制替换为基于单一参考的注意力共享机制，从而有效对齐不同视角间的风格。此外，受近期三维修复方法启发，我们利用多深度图网格作为单图像参考，以进一步增强风格化图像间的视角一致性。最后，我们提出多区域重要性加权切片瓦瑟斯坦距离损失，允许使用现成模型的分割掩码将风格应用于不同的图像区域。我们证明这一可选特性提升了风格迁移的保真度，并支持在场景的不同区域混合多种风格。定性与定量的实验评估表明，我们的流程有效改进了文本驱动三维风格化的结果。项目页面：https://haruolabs.github.io/improved-gs-style-page/