NOVUM: Neural Object Volumes for Robust Object Classification

Discriminative models for object classification typically learn image-based representations that do not capture the compositional and 3D nature of objects. In this work, we show that explicitly integrating 3D compositional object representations into deep networks for image classification leads to a largely enhanced generalization in out-of-distribution scenarios. In particular, we introduce a novel architecture, referred to as \OURS, that consists of a feature extractor and a \textit{neural object volume} for every target object class. Each neural object volume is a composition of 3D Gaussians that emit feature vectors. This compositional object representation allows for a highly robust and fast estimation of the object class by independently matching the features of the 3D Gaussians of each category to features extracted from an input image. Additionally, the object pose can be estimated via inverse rendering of the corresponding neural object volume. To enable the classification of objects, the neural features at each 3D Gaussian are trained discriminatively to be distinct from (i) the features of 3D Gaussians in other categories, (ii) features of other 3D Gaussians of the same object, and (iii) the background features. Our experiments show that \OURS offers intriguing advantages over standard architectures due to the 3D compositional structure of the object representation, namely: (1) An exceptional robustness across a spectrum of real-world and synthetic out-of-distribution shifts and (2) an enhanced human interpretability compared to standard models, all while maintaining real-time inference and a competitive accuracy on in-distribution data.

翻译：用于物体分类的判别式模型通常学习基于图像的表征，这些表征未能捕捉物体的组合性与三维本质。本研究表明，将三维组合式物体表征显式地集成到用于图像分类的深度网络中，能显著提升模型在分布外场景中的泛化能力。具体而言，我们提出了一种称为NOVUM的新型架构，该架构包含一个特征提取器以及针对每个目标物体类别的“神经物体体积”。每个神经物体体积均由一组发射特征向量的三维高斯分布组合而成。这种组合式物体表征通过将每个类别的三维高斯分布特征与输入图像提取的特征进行独立匹配，实现了高度鲁棒且快速的物体类别估计。此外，物体姿态可通过对应神经物体体积的逆向渲染进行估计。为实现物体分类，每个三维高斯分布处的神经特征经过判别式训练，以区别于：（一）其他类别中三维高斯分布的特征；（二）同一物体中其他三维高斯分布的特征；（三）背景特征。实验表明，得益于物体表征的三维组合结构，NOVUM相较于标准架构展现出显著优势：（1）在真实世界与合成的分布外偏移场景中具有卓越的鲁棒性；（2）与标准模型相比增强了人类可解释性，同时保持实时推理能力，并在分布内数据上达到具有竞争力的准确率。