LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition

Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.

翻译：群体活动识别（GAR）因涉及多智能体间的复杂交互，在计算机视觉领域仍具挑战性。本文提出LiGAR，一种基于激光雷达引导的层次化Transformer多模态群体活动识别方法。LiGAR利用激光雷达数据作为结构骨架，引导视觉与文本信息的处理，从而实现对遮挡与复杂空间布局的鲁棒处理。我们的框架包含多尺度激光雷达Transformer、跨模态引导注意力机制以及自适应融合模块，以在不同语义层级有效整合多模态数据。LiGAR的层次化架构能够捕捉从个体动作到场景级动态的多粒度群体活动。在JRDB-PAR、Volleyball和NBA数据集上的大量实验表明，LiGAR性能优越，在JRDB-PAR上F1分数提升高达10.6%，在NBA数据集上平均每类准确率提升5.9%，达到了最先进水平。值得注意的是，即使在推理阶段激光雷达数据缺失的情况下，LiGAR仍能保持高性能，展现了其良好的适应性。我们的消融研究证实了各组件的重要贡献，以及所提出的多模态、多尺度方法在推动群体活动识别领域发展的有效性。

相关内容

GROUP

关注 1

Group一直是研究计算机支持的合作工作、人机交互、计算机支持的协作学习和社会技术研究的主要场所。该会议将社会科学、计算机科学、工程、设计、价值观以及其他与小组工作相关的多个不同主题的工作结合起来，并进行了广泛的概念化。官网链接：https://group.acm.org/conferences/group20/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日