Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.
翻译:当前,大多数图像理解任务的方法依赖于前馈神经网络。尽管这种方法通过微调实现了经验上的准确性、效率和任务适应性,但它也伴随着根本性缺陷。现有网络通常难以跨不同数据集泛化,即便是在相同任务上。由于设计本质,这些网络最终对高维场景特征进行推理,而这使得分析变得困难,尤其是在尝试基于2D图像预测3D信息时更是如此。我们提出将基于RGB相机的3D多目标跟踪重新表述为一个逆渲染问题,通过可微渲染管线在预训练3D对象表征的隐空间中进行优化,并检索出最能表征给定输入图像中对象实例的隐变量。为此,我们在生成式隐空间上优化图像损失,这些隐空间固有地解耦了形状与外观属性。我们不仅探索了一种替代性的跟踪方法,而且我们的方法还能够检查生成的对象、分析失败情况并解决歧义案例。通过仅使用合成数据学习生成先验,并在nuScenes和Waymo数据集上评估基于相机的3D跟踪,我们验证了该方法在泛化与扩展能力上的表现。这两个数据集对我们的方法而言完全未知,且无需微调。视频与代码详见https://light.princeton.edu/inverse-rendering-tracking/。