Existing lifting networks for regressing 3D human poses from 2D single-view poses are typically constructed with linear layers based on graph-structured representation learning. In sharp contrast to them, this paper presents Grid Convolution (GridConv), mimicking the wisdom of regular convolution operations in image space. GridConv is based on a novel Semantic Grid Transformation (SGT) which leverages a binary assignment matrix to map the irregular graph-structured human pose onto a regular weave-like grid pose representation joint by joint, enabling layer-wise feature learning with GridConv operations. We provide two ways to implement SGT, including handcrafted and learnable designs. Surprisingly, both designs turn out to achieve promising results and the learnable one is better, demonstrating the great potential of this new lifting representation learning formulation. To improve the ability of GridConv to encode contextual cues, we introduce an attention module over the convolutional kernel, making grid convolution operations input-dependent, spatial-aware and grid-specific. We show that our fully convolutional grid lifting network outperforms state-of-the-art methods with noticeable margins under (1) conventional evaluation on Human3.6M and (2) cross-evaluation on MPI-INF-3DHP. Code is available at https://github.com/OSVAI/GridConv
翻译:现有从二维单视角姿态回归三维人体姿态的提升网络通常基于图结构表示学习,由线性层构建。与此形成鲜明对比,本文提出网格卷积(GridConv),模拟图像空间中常规卷积操作的智慧。GridConv基于一种新颖的语义网格变换(SGT),该变换利用二元分配矩阵,将不规则的图结构人体姿态逐关节映射到规则的编织状网格姿态表示上,从而通过GridConv操作实现逐层特征学习。我们提供了两种实现SGT的方式,包括手工设计和可学习设计。令人惊讶的是,两种设计均取得了有前景的结果,其中可学习设计更优,展现了这种新的提升表示学习范式的巨大潜力。为增强GridConv编码上下文线索的能力,我们引入了卷积核上的注意力模块,使得网格卷积操作具有输入依赖、空间感知和网格特异性。我们证明,全卷积网格提升网络在(1)Human3.6M上的常规评估和(2)MPI-INF-3DHP上的交叉评估中,均以显著优势优于现有最先进方法。代码发布于https://github.com/OSVAI/GridConv。