In real-world applications, it is essential to jointly estimate the 3D object pose and class label of objects, i.e., to perform 3D-aware classification.While current approaches for either image classification or pose estimation can be extended to 3D-aware classification, we observe that they are inherently limited: 1) Their performance is much lower compared to the respective single-task models, and 2) they are not robust in out-of-distribution (OOD) scenarios. Our main contribution is a novel architecture for 3D-aware classification, which builds upon a recent work and performs comparably to single-task models while being highly robust. In our method, an object category is represented as a 3D cuboid mesh composed of feature vectors at each mesh vertex. Using differentiable rendering, we estimate the 3D object pose by minimizing the reconstruction error between the mesh and the feature representation of the target image. Object classification is then performed by comparing the reconstruction losses across object categories. Notably, the neural texture of the mesh is trained in a discriminative manner to enhance the classification performance while also avoiding local optima in the reconstruction loss. Furthermore, we show how our method and feed-forward neural networks can be combined to scale the render-and-compare approach to larger numbers of categories. Our experiments on PASCAL3D+, occluded-PASCAL3D+, and OOD-CV show that our method outperforms all baselines at 3D-aware classification by a wide margin in terms of performance and robustness.
翻译:在真实世界应用中,联合估计物体的3D姿态和类别标签(即执行3D感知分类)至关重要。尽管当前的图像分类或姿态估计方法均可扩展至3D感知分类,但我们观察到它们存在固有限制:1) 其性能远低于各单任务模型,2) 在分布外场景中缺乏鲁棒性。本文主要贡献在于提出了一种新型3D感知分类架构,该架构基于近期工作,性能可与单任务模型媲美且具有高度鲁棒性。在我们的方法中,物体类别被表示为三维长方体网格,每个网格顶点附着特征向量。通过可微渲染,我们最小化网格与目标图像特征表示之间的重建误差来估计物体3D姿态。随后,通过比较各物体类别间的重建损失完成分类。值得注意的是,网格神经纹理以判别方式训练,既提升了分类性能,又避免了重建损失中的局部最优问题。此外,我们展示了如何将本方法与前馈神经网络相结合,将渲染-比较方法扩展至更多类别。在PASCAL3D+、遮挡PASCAL3D+及OOD-CV上的实验表明,本方法在3D感知分类的性能与鲁棒性方面均大幅超越所有基线模型。