Directly regressing the non-rigid shape and camera pose from the individual 2D frame is ill-suited to the Non-Rigid Structure-from-Motion (NRSfM) problem. This frame-by-frame 3D reconstruction pipeline overlooks the inherent spatial-temporal nature of NRSfM, i.e., reconstructing the whole 3D sequence from the input 2D sequence. In this paper, we propose to model deep NRSfM from a sequence-to-sequence translation perspective, where the input 2D frame sequence is taken as a whole to reconstruct the deforming 3D non-rigid shape sequence. First, we apply a shape-motion predictor to estimate the initial non-rigid shape and camera motion from a single frame. Then we propose a context modeling module to model camera motions and complex non-rigid shapes. To tackle the difficulty in enforcing the global structure constraint within the deep framework, we propose to impose the union-of-subspace structure by replacing the self-expressiveness layer with multi-head attention and delayed regularizers, which enables end-to-end batch-wise training. Experimental results across different datasets such as Human3.6M, CMU Mocap and InterHand prove the superiority of our framework.
翻译:直接从单个二维帧回归非刚性形状和相机姿态并不适用于非刚性运动恢复结构问题。这种逐帧三维重建流程忽视了NRSfM固有的时空特性,即从输入的二维序列重建完整的三维序列。本文提出从序列到序列翻译的角度建模深度NRSfM,将输入的二维帧序列作为整体来重建形变的三维非刚性形状序列。首先,我们应用形状-运动预测器从单帧估计初始非刚性形状和相机运动。随后提出上下文建模模块来建模相机运动与复杂非刚性形状。为解决深度框架中全局结构约束难以实施的问题,我们提出通过用多头注意力机制和延迟正则化器替换自表达层来施加子空间并集结构,从而实现端到端的批训练。在Human3.6M、CMU Mocap和InterHand等不同数据集上的实验结果证明了本框架的优越性。