Driven by powerful image diffusion models, recent research has achieved the automatic creation of 3D objects from textual or visual guidance. By performing score distillation sampling (SDS) iteratively across different views, these methods succeed in lifting 2D generative prior to the 3D space. However, such a 2D generative image prior bakes the effect of illumination and shadow into the texture. As a result, material maps optimized by SDS inevitably involve spurious correlated components. The absence of precise material definition makes it infeasible to relight the generated assets reasonably in novel scenes, which limits their application in downstream scenarios. In contrast, humans can effortlessly circumvent this ambiguity by deducing the material of the object from its appearance and semantics. Motivated by this insight, we propose MaterialSeg3D, a 3D asset material generation framework to infer underlying material from the 2D semantic prior. Based on such a prior model, we devise a mechanism to parse material in 3D space. We maintain a UV stack, each map of which is unprojected from a specific viewpoint. After traversing all viewpoints, we fuse the stack through a weighted voting scheme and then employ region unification to ensure the coherence of the object parts. To fuel the learning of semantics prior, we collect a material dataset, named Materialized Individual Objects (MIO), which features abundant images, diverse categories, and accurate annotations. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method.
翻译:受强大图像扩散模型的驱动,近期研究已实现从文本或视觉引导自动创建三维物体。通过在不同视角间迭代执行分数蒸馏采样(SDS),这些方法成功将二维生成先验提升至三维空间。然而,此类二维生成图像先验将光照和阴影效果烘培至纹理中,导致通过SDS优化的材质映射不可避免地混入虚假相关成分。缺乏精确的材质定义使得生成的资产无法在新场景中合理重光照,从而限制了其在下游场景中的应用。相比之下,人类能通过外观和语义推断物体材质,轻松规避此类歧义。受此启发,我们提出MaterialSeg3D——一种从二维语义先验推断潜在材质的三维资产材质生成框架。基于该先验模型,我们设计了一种在三维空间中解析材质的机制:维护一个UV堆栈,其中每个映射图均从特定视点反投影生成;遍历所有视点后,通过加权投票方案融合堆栈,并采用区域统一策略确保物体部件的一致性。为支撑语义先验的学习,我们构建了名为Materialized Individual Objects(MIO)的材质数据集,其具备丰富图像、多样类别及精确标注。大量定性与定量实验验证了本方法的有效性。