The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine, M3Act features multiple semantic groups, highly diverse and photorealistic images, and a comprehensive set of annotations, which facilitates the learning of human-centered tasks across single-person, multi-person, and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments. The results suggest our synthetic dataset can significantly improve the performance of several downstream methods and replace real-world datasets to reduce cost. Notably, M3Act improves the state-of-the-art MOTRv2 on DanceTrack dataset, leading to a hop on the leaderboard from 10th to 2nd place. Moreover, M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task. Our code and data are available at our project page: http://cjerry1243.github.io/M3Act.
翻译:复杂人际交互与群体活动研究已成为以人为中心的计算机视觉领域的焦点。然而,相关任务的进展常受限于从真实场景获取大规模标注数据集的挑战。为解决此问题,我们提出M3Act——一种用于多视角、多群体、多人原子动作与群体活动的合成数据生成器。该生成器基于Unity引擎构建,具备多语义群体、高多样性与逼真图像及全面标注集,可促进单人、多人及多群体条件下以人为中心任务的学习。我们通过三项核心实验展示了M3Act的优势。结果表明,我们的合成数据集能显著提升多种下游方法的性能,并可替代真实世界数据集以降低成本。值得注意的是,M3Act在DanceTrack数据集上改进了当前最优方法MOTRv2,使其在排行榜上从第10位跃升至第2位。此外,M3Act为可控三维群体活动生成开辟了新研究方向。我们定义了多项指标并为该新任务提出了竞争性基线。我们的代码与数据已发布于项目页面:http://cjerry1243.github.io/M3Act。