Quantifying motion in 3D is important for studying the behavior of humans and other animals, but manual pose annotations are expensive and time-consuming to obtain. Self-supervised keypoint discovery is a promising strategy for estimating 3D poses without annotations. However, current keypoint discovery approaches commonly process single 2D views and do not operate in the 3D space. We propose a new method to perform self-supervised keypoint discovery in 3D from multi-view videos of behaving agents, without any keypoint or bounding box supervision in 2D or 3D. Our method, BKinD-3D, uses an encoder-decoder architecture with a 3D volumetric heatmap, trained to reconstruct spatiotemporal differences across multiple views, in addition to joint length constraints on a learned 3D skeleton of the subject. In this way, we discover keypoints without requiring manual supervision in videos of humans and rats, demonstrating the potential of 3D keypoint discovery for studying behavior.
翻译:三维运动量化对于研究人类及其他动物的行为至关重要,但人工姿态标注成本高昂且耗时。无监督关键点发现是一种无需标注即可估计三维姿态的有效策略。然而,现有方法通常仅处理单视角二维图像,无法在三维空间中进行操作。我们提出了一种新方法,通过行为主体的多视角视频实现三维空间中的无监督关键点发现,无需任何二维或三维关键点及边界框标注。该方法BKinD-3D采用带有三维体积热图的编码器-解码器架构,通过训练重构多视角间的时空差异,并对学习得到的受试者三维骨架施加关节长度约束。由此,我们无需人工标注即可在人类与老鼠视频中发现关键点,展示了三维关键点发现技术用于行为研究的潜力。