Capturing the interactions between humans and their environment in 3D is important for many applications in robotics, graphics, and vision. Recent works to reconstruct the 3D human and object from a single RGB image do not have consistent relative translation across frames because they assume a fixed depth. Moreover, their performance drops significantly when the object is occluded. In this work, we propose a novel method to track the 3D human, object, contacts between them, and their relative translation across frames from a single RGB camera, while being robust to heavy occlusions. Our method is built on two key insights. First, we condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence. This improves neural reconstruction accuracy and produces coherent relative translation across frames. Second, human and object motion from visible frames provides valuable information to infer the occluded object. We propose a novel transformer-based neural network that explicitly uses object visibility and human motion to leverage neighbouring frames to make predictions for the occluded frames. Building on these insights, our method is able to track both human and object robustly even under occlusions. Experiments on two datasets show that our method significantly improves over the state-of-the-art methods. Our code and pretrained models are available at: https://virtualhumans.mpi-inf.mpg.de/VisTracker
翻译:从单张RGB图像中捕捉人类与环境的三维交互对于机器人学、图形学和视觉领域的众多应用至关重要。现有方法在从单张RGB图像重建三维人体和物体时,因假设固定深度而导致帧间相对平移不一致。此外,当物体被遮挡时,其性能会显著下降。本文提出了一种新颖方法,能够通过单目RGB摄像机追踪三维人体、物体及其接触关系,并保持帧间相对平移的一致性,同时对严重遮挡具有鲁棒性。该方法基于两个关键洞察:首先,通过将SMPL模型预先拟合到视频序列获得的逐帧估计结果,作为人体和物体神经场重建的条件,这提升了神经重建精度,并产生连贯的帧间相对平移;其次,可见帧中的人体与物体运动为推断被遮挡物体提供了宝贵信息。我们提出了一种基于Transformer的新型神经网络,该网络显式利用物体可见性和人体运动,借助相邻帧对遮挡帧进行预测。基于这些洞察,我们的方法即使在遮挡条件下也能稳健地追踪人体和物体。在两个数据集上的实验表明,本方法显著优于现有技术。相关代码与预训练模型已开源:https://virtualhumans.mpi-inf.mpg.de/VisTracker