Along with the development of modern smart cities, human-centric video analysis has been encountering the challenge of analyzing diverse and complex events in real scenes. A complex event relates to dense crowds, anomalous individuals, or collective behaviors. However, limited by the scale and coverage of existing video datasets, few human analysis approaches have reported their performances on such complex events. To this end, we present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve (Human-centric video analysis in complex Events), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd & complex events. It contains a record number of poses (>1M), the largest number of action instances (>56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time (with an average trajectory length of >480 frames). Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation, respectively. They leverage cross-label information during training to enhance the feature learning in corresponding visual tasks. Experiments show that they could boost the performance of existing action recognition and pose estimation pipelines. More importantly, they prove the widely ranged annotations in HiEve can improve various video tasks. Furthermore, we conduct extensive experiments to benchmark recent video analysis approaches together with our baseline methods, demonstrating HiEve is a challenging dataset for human-centric video analysis. We expect that the dataset will advance the development of cutting-edge techniques in human-centric analysis and the understanding of complex events. The dataset is available at http://humaninevents.org
翻译:随着现代智慧城市的发展,以人为中心的视频分析正面临现实场景中多样且复杂事件的分析挑战。复杂事件涉及密集人群、异常个体或群体行为。然而,受现有视频数据集规模与覆盖范围的限制,少有方法能报告其在复杂事件上的表现。为此,我们提出一个包含全面标注的新大规模数据集——Human-in-Events (HiEve, 复杂事件中以人为中心的视频分析),用于理解各类真实事件中的人体运动、姿态与动作,尤其在密集人群与复杂场景中。该数据集包含创纪录的超过100万个姿态标注、复杂事件下最大规模的动作实例(超过56,000个),以及持续时间最长的轨迹之一(平均轨迹长度超过480帧)。基于其多样化的标注,我们分别提出动作识别与姿态估计的两个简单基线方法,通过在训练过程中利用跨标签信息增强对应视觉任务的特征学习。实验证明,这些方法能够提升现有动作识别与姿态估计管线的性能。更重要的是,它们证实HiEve的广泛标注可改善多种视频任务。进一步地,我们开展大量实验,结合基线方法对近期视频分析方法进行基准测试,表明HiEve是一个具有挑战性的人类中心视频分析数据集。我们期望该数据集能推动人类中心分析尖端技术的发展及对复杂事件的理解。数据集可通过 http://humaninevents.org 获取。