The recognition of dynamic and social behavior in animals is fundamental for advancing ethology, ecology, medicine and neuroscience. Recent progress in deep learning has enabled automated behavior recognition from video, yet an accurate reconstruction of the three-dimensional (3D) pose and shape has not been integrated into this process. Especially for non-human primates, mesh-based tracking efforts lag behind those for other species, leaving pose descriptions restricted to sparse keypoints that are unable to fully capture the richness of action dynamics. To address this gap, we introduce the $\textbf{Big Ma}$ca$\textbf{Q}$ue 3D Motion and Animation Dataset ($\texttt{BigMaQ}$), a large-scale dataset comprising more than 750 scenes of interacting rhesus macaques with detailed 3D pose descriptions. Extending previous surface-based animal tracking methods, we construct subject-specific textured avatars by adapting a high-quality macaque template mesh to individual monkeys. This allows us to provide pose descriptions that are more accurate than previous state-of-the-art surface-based animal tracking methods. From the original dataset, we derive BigMaQ500, an action recognition benchmark that links surface-based pose vectors to single frames across multiple individual monkeys. By pairing features extracted from established image and video encoders with and without our pose descriptors, we demonstrate substantial improvements in mean average precision (mAP) when pose information is included. With these contributions, $\texttt{BigMaQ}$ establishes the first dataset that both integrates dynamic 3D pose-shape representations into the learning task of animal action recognition and provides a rich resource to advance the study of visual appearance, posture, and social interaction in non-human primates. The code and data are publicly available at https://martinivis.github.io/BigMaQ/ .
翻译:动物动态行为与社会行为的识别是推动动物行为学、生态学、医学和神经科学发展的基础。深度学习的近期进展已使得从视频中自动识别行为成为可能,但精确的三维姿态与形状重建尚未被整合到这一过程中。特别是对于非人灵长类动物,基于网格的追踪研究落后于其他物种,导致姿态描述仅限于稀疏的关键点,无法完整捕捉动作动态的丰富性。为填补这一空白,我们引入了 **Big Ma**ca**Q**ue 三维运动与动画数据集(`BigMaQ`),这是一个包含超过750个互动恒河猴场景的大规模数据集,并提供了详细的三维姿态描述。通过扩展先前基于表面的动物追踪方法,我们通过将高质量的猕猴模板网格适配到个体猴子,构建了主体特定的带纹理虚拟化身。这使得我们能够提供比先前最先进的基于表面的动物追踪方法更精确的姿态描述。从原始数据集中,我们衍生出 BigMaQ500,这是一个动作识别基准,将基于表面的姿态向量与多个个体猴子的单帧图像关联起来。通过将来自成熟图像与视频编码器提取的特征(无论是否包含我们的姿态描述符)进行配对,我们证明了当包含姿态信息时,平均精度均值(mAP)有显著提升。凭借这些贡献,`BigMaQ` 建立了首个既将动态三维姿态-形状表示整合到动物动作识别学习任务中,又为推进非人灵长类动物的视觉外观、姿态及社会交互研究提供了丰富资源的数据集。代码与数据公开于 https://martinivis.github.io/BigMaQ/ 。