Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.
翻译:群体活动识别(GAR)因涉及多智能体间的复杂交互,在计算机视觉领域仍具挑战性。本文提出LiGAR,一种基于激光雷达引导的层次化Transformer多模态群体活动识别方法。LiGAR利用激光雷达数据作为结构骨架,引导视觉与文本信息的处理,从而实现对遮挡与复杂空间布局的鲁棒处理。我们的框架包含多尺度激光雷达Transformer、跨模态引导注意力机制以及自适应融合模块,以在不同语义层级有效整合多模态数据。LiGAR的层次化架构能够捕捉从个体动作到场景级动态的多粒度群体活动。在JRDB-PAR、Volleyball和NBA数据集上的大量实验表明,LiGAR性能优越,在JRDB-PAR上F1分数提升高达10.6%,在NBA数据集上平均每类准确率提升5.9%,达到了最先进水平。值得注意的是,即使在推理阶段激光雷达数据缺失的情况下,LiGAR仍能保持高性能,展现了其良好的适应性。我们的消融研究证实了各组件的重要贡献,以及所提出的多模态、多尺度方法在推动群体活动识别领域发展的有效性。