A comprehensive understanding of interested human-to-human interactions in video streams, such as queuing, handshaking, fighting and chasing, is of immense importance to the surveillance of public security in regions like campuses, squares and parks. Different from conventional human interaction recognition, which uses choreographed videos as inputs, neglects concurrent interactive groups, and performs detection and recognition in separate stages, we introduce a new task named human-to-human interaction detection (HID). HID devotes to detecting subjects, recognizing person-wise actions, and grouping people according to their interactive relations, in one model. First, based on the popular AVA dataset created for action detection, we establish a new HID benchmark, termed AVA-Interaction (AVA-I), by adding annotations on interactive relations in a frame-by-frame manner. AVA-I consists of 85,254 frames and 86,338 interactive groups, and each image includes up to 4 concurrent interactive groups. Second, we present a novel baseline approach SaMFormer for HID, containing a visual feature extractor, a split stage which leverages a Transformer-based model to decode action instances and interactive groups, and a merging stage which reconstructs the relationship between instances and groups. All SaMFormer components are jointly trained in an end-to-end manner. Extensive experiments on AVA-I validate the superiority of SaMFormer over representative methods. The dataset and code will be made public to encourage more follow-up studies.
翻译:对视频流中感兴趣的人人交互(如排队、握手、打斗和追逐)的全面理解,对于校园、广场、公园等区域的公共安全监控至关重要。不同于传统的使用编排视频作为输入、忽视并发交互群体、且分阶段执行检测与识别的人体交互识别方法,我们提出了一项名为"人人交互检测"(HID)的新任务。HID致力于在一个模型中同时完成主体检测、个体动作识别以及基于交互关系的人群分组。首先,基于用于动作检测的流行AVA数据集,我们通过逐帧添加交互关系标注,建立了一个新的HID基准数据集,称为AVA-Interaction(AVA-I)。AVA-I包含85,254帧和86,338个交互群体,每张图像最多包含4个并发交互群体。其次,我们提出了一种新颖的HID基线方法SaMFormer,该方法包含:视觉特征提取器、利用Transformer模型解码动作实例与交互群体的分裂阶段,以及重构实例与群体间关系的合并阶段。SaMFormer的所有组件均以端到端方式联合训练。在AVA-I上的大量实验验证了SaMFormer相对于其他代表性方法的优越性。数据集和代码将公开,以促进更多后续研究。