Determining the 3D orientations of an object in an image, known as single-image pose estimation, is a crucial task in 3D vision applications. Existing methods typically learn 3D rotations parametrized in the spatial domain using Euler angles or quaternions, but these representations often introduce discontinuities and singularities. SO(3)-equivariant networks enable the structured capture of pose patterns with data-efficient learning, but the parametrizations in spatial domain are incompatible with their architecture, particularly spherical CNNs, which operate in the frequency domain to enhance computational efficiency. To overcome these issues, we propose a frequency-domain approach that directly predicts Wigner-D coefficients for 3D rotation regression, aligning with the operations of spherical CNNs. Our SO(3)-equivariant pose harmonics predictor overcomes the limitations of spatial parameterizations, ensuring consistent pose estimation under arbitrary rotations. Trained with a frequency-domain regression loss, our method achieves state-of-the-art results on benchmarks such as ModelNet10-SO(3) and PASCAL3D+, with significant improvements in accuracy, robustness, and data efficiency.
翻译:从单幅图像中确定物体的三维方向,即单图像姿态估计,是三维视觉应用中的关键任务。现有方法通常使用欧拉角或四元数学习空间域参数化的三维旋转,但这些表示常引入不连续性和奇异性。SO(3)-等变网络能够通过数据高效学习结构化捕捉姿态模式,但空间域参数化与其架构(特别是球面CNN)不兼容——球面CNN在频域运行以提升计算效率。为解决这些问题,我们提出一种频域方法,直接预测用于三维旋转回归的Wigner-D系数,与球面CNN的运算机制保持一致。我们的SO(3)-等变姿态调和函数预测器克服了空间参数化的局限性,确保在任意旋转下实现一致的姿态估计。通过频域回归损失进行训练,本方法在ModelNet10-SO(3)和PASCAL3D+等基准测试中取得了最先进的结果,在精度、鲁棒性和数据效率方面均有显著提升。