Learning about the three-dimensional world from two-dimensional images is a fundamental problem in computer vision. An ideal neural network architecture for such tasks would leverage the fact that objects can be rotated and translated in three dimensions to make predictions about novel images. However, imposing SO(3)-equivariance on two-dimensional inputs is difficult because the group of three-dimensional rotations does not have a natural action on the two-dimensional plane. Specifically, it is possible that an element of SO(3) will rotate an image out of plane. We show that an algorithm that learns a three-dimensional representation of the world from two dimensional images must satisfy certain geometric consistency properties which we formulate as SO(2)-equivariance constraints. We use the induced and restricted representations of SO(2) on SO(3) to construct and classify architectures which satisfy these geometric consistency constraints. We prove that any architecture which respects said consistency constraints can be realized as an instance of our construction. We show that three previously proposed neural architectures for 3D pose prediction are special cases of our construction. We propose a new algorithm that is a learnable generalization of previously considered methods. We test our architecture on three pose predictions task and achieve SOTA results on both the PASCAL3D+ and SYMSOL pose estimation tasks.
翻译:从二维图像学习三维世界是计算机视觉中的基本问题。理想的神经网络架构应利用物体可在三维空间中旋转和平移这一特性,从而对未见图像进行预测。然而,在二维输入上施加SO(3)-等变性存在困难,因为三维旋转群在二维平面上没有自然的群作用。具体而言,SO(3)中的元素可能将图像旋转出平面。我们证明,从二维图像学习三维世界表示的算法必须满足特定的几何一致性性质,并将其表述为SO(2)-等变性约束。我们利用SO(2)在SO(3)上的诱导表示和限制表示,构建并分类满足这些几何一致性约束的架构。我们证明,任何满足所述一致性约束的架构均可作为我们构造方法的实例实现。我们展示三种先前提出的三维姿态预测神经网络架构均为我们构造方法的特例。我们提出一种新算法,该算法是先前方法可学习的泛化版本。我们在三个姿态预测任务上测试该架构,并在PASCAL3D+和SYMSOL姿态估计任务中实现当前最优结果。