Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. %The two divided parts can help each other for better mocap results under different conditions. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.
翻译:RGB图像或惯性信号均可用于动作捕捉任务,但将两者结合是一个新颖且有趣的研究课题。我们认为,这种融合具有互补性,能够解决单模态输入固有的难题,包括视觉动作捕捉中的遮挡、极端光照/纹理及视野外问题,以及惯性动作捕捉中的全局漂移问题。为此,我们提出了一种融合单目图像与稀疏IMU的实时人体动作捕捉方法。该方法采用双坐标策略,充分利用IMU信号实现动作捕捉中的不同目标。具体而言,除了一条分支将IMU信号转换至相机坐标系以融合图像信息外,另一条分支在人体根坐标系中学习IMU信号以更准确地估计身体姿态。此外,我们提出了一种隐状态反馈机制,两条分支均可利用该机制在极端输入情况下补偿自身缺陷。因此,该方法能根据不同场景在两类信号间灵活切换或进行融合,从而实现鲁棒的动作捕捉。定量与定性结果表明,通过精细设计融合方法,本技术在全局朝向和局部姿态估计上显著优于当前最先进的纯视觉方法、纯IMU方法以及两者结合的方法。相关代码已在https://shaohua-pan.github.io/robustcap-page/开源供研究使用。