This paper proposes PoseLecTr, a graph-based encoder-decoder framework that integrates a novel Legendre convolution with attention mechanisms for six-degree-of-freedom (6-DOF) object pose estimation from monocular RGB images. Conventional learning-based approaches predominantly rely on grid-structured convolutions, which can limit their ability to model higher-order and long-range dependencies among image features, especially in cluttered or occluded scenes. PoseLecTr addresses this limitation by constructing a graph representation from image features, where spatial relationships are explicitly modeled through graph connectivity. The proposed framework incorporates a Legendre convolution layer to improve numerical stability in graph convolution, together with spatial-attention and self-attention distillation to enhance feature selection. Experiments conducted on the LINEMOD, Occluded LINEMOD, and YCB-VIDEO datasets demonstrate that our method achieves competitive performance and shows consistent improvements across a wide range of objects and scene complexities.
翻译:本文提出PoseLecTr,一种基于图结构的编码器-解码器框架,该框架将新型勒让德卷积与注意力机制相结合,用于从单目RGB图像中进行六自由度物体姿态估计。传统的基于学习的方法主要依赖网格结构卷积,这限制了其对图像特征间高阶及长程依赖关系的建模能力,尤其在杂乱或遮挡场景中更为明显。PoseLecTr通过从图像特征构建图表示来解决这一局限,其中空间关系通过图连接性进行显式建模。该框架引入勒让德卷积层以提升图卷积的数值稳定性,并结合空间注意力与自注意力蒸馏机制以增强特征选择能力。在LINEMOD、Occluded LINEMOD和YCB-VIDEO数据集上的实验表明,本方法取得了具有竞争力的性能,并在多种物体类型及场景复杂度下均表现出稳定的性能提升。