Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
翻译:单目法线估计旨在从任意光照条件下物体的单张RGB图像中估计法线贴图。现有方法依赖深度模型直接预测法线贴图。然而,这些方法常受三维错位问题困扰:尽管估计的法线贴图可能呈现正确的表观,重建的表面往往无法与几何细节对齐。我们认为这种错位源于当前范式:模型难以区分和重建法线贴图中表征的变化几何结构,因为底层几何的差异仅通过相对细微的颜色变化反映。为解决此问题,我们提出一种新范式,将法线估计重新定义为光照序列估计,其中光照序列对各种几何信息更为敏感。基于此范式,我们提出RoSE方法,该方法利用图像到视频生成模型预测光照序列。预测的光照序列随后通过求解简单的普通最小二乘问题转换为法线贴图。为增强鲁棒性并更好地处理复杂物体,RoSE在包含多样形状、材质和光照条件的合成数据集MultiShade上进行训练。实验表明,RoSE在面向物体的单目法线估计真实世界基准数据集上取得了最先进的性能。