The understanding of complex human interactions and group activities has garnered attention in human-centric computer vision. However, the advancement of the related tasks is hindered due to the difficulty of obtaining large-scale labeled real-world datasets. To mitigate the issue, we propose M3Act, a multi-view multi-group multi-person human atomic action and group activity data generator. Powered by the Unity engine, M3Act contains simulation-ready 3D scenes and human assets, configurable lighting and camera systems, highly parameterized modular group activities, and a large degree of domain randomization during the data generation process. Our data generator is capable of generating large-scale datasets of human activities with multiple viewpoints, modalities (RGB images, 2D poses, 3D motions), and high-quality annotations for individual persons and multi-person groups (2D bounding boxes, instance segmentation masks, individual actions and group activity categories). Using M3Act, we perform synthetic data pre-training for 2D skeleton-based group activity recognition and RGB-based multi-person pose tracking. The results indicate that learning from our synthetic datasets largely improves the model performances on real-world datasets, with the highest gain of 5.59% and 7.32% respectively in group and person recognition accuracy on CAD2, as well as an improvement of 6.63 in MOTP on HiEve. Pre-training with our synthetic data also leads to faster model convergence on downstream tasks (up to 6.8% faster). Moreover, M3Act opens new research problems for 3D group activity generation. We release M3Act3D, an 87.6-hour 3D motion dataset of human activities with larger group sizes and higher complexity of inter-person interactions than previous multi-person datasets. We define multiple metrics and propose a competitive baseline for the novel task.
翻译:复杂的人类交互与群体活动理解已成为以人为中心的计算机视觉领域的研究热点。然而,由于难以获取大规模标注的真实世界数据集,相关任务的进展受到制约。为解决这一问题,我们提出M3Act——一个多视角、多群体、多人的原子动作与群体活动数据生成器。基于Unity引擎,M3Act包含仿真就绪的3D场景与人物资产、可配置的照明与相机系统、高度参数化的模块化群体活动,并在数据生成过程中引入大范围域随机化。该数据生成器能够生成具有多视角、多模态(RGB图像、2D姿态、3D运动)以及针对个体与多人群体的高质量标注(2D边界框、实例分割掩码、个体动作与群体活动类别)的人类活动大规模数据集。利用M3Act,我们开展了基于2D骨架的群体活动识别与基于RGB的多人体姿态追踪的合成数据预训练。结果表明,在合成数据集上学习能显著提升模型在真实数据集上的表现:在CAD2数据集上,群体与个体识别准确率分别最高提升5.59%和7.32%;在HiEve数据集上,多目标追踪精度(MOTP)提升6.63。此外,合成数据预训练可加快下游任务的模型收敛速度(最高提升6.8%)。M3Act还开创了3D群体活动生成这一新研究方向。我们发布了M3Act3D数据集——包含87.6小时的人类活动3D运动数据,其群体规模及人际交互复杂度均超越以往多人数据集。针对这一新任务,我们定义了多项评价指标,并提出了具有竞争力的基准方法。