Previous methods usually only extract the image modality's information to recognize group activity. However, mining image information is approaching saturation, making it difficult to extract richer information. Therefore, extracting complementary information from other modalities to supplement image information has become increasingly important. In fact, action labels provide clear text information to express the action's semantics, which existing methods often overlook. Thus, we propose ActivityCLIP, a plug-and-play method for mining the text information contained in the action labels to supplement the image information for enhancing group activity recognition. ActivityCLIP consists of text and image branches, where the text branch is plugged into the image branch (The off-the-shelf image-based method). The text branch includes Image2Text and relation modeling modules. Specifically, we propose the knowledge transfer module, Image2Text, which adapts image information into text information extracted by CLIP via knowledge distillation. Further, to keep our method convenient, we add fewer trainable parameters based on the relation module of the image branch to model interaction relation in the text branch. To show our method's generality, we replicate three representative methods by ActivityCLIP, which adds only limited trainable parameters, achieving favorable performance improvements for each method. We also conduct extensive ablation studies and compare our method with state-of-the-art methods to demonstrate the effectiveness of ActivityCLIP.
翻译:先前的方法通常仅提取图像模态信息来识别群体活动。然而,图像信息挖掘已趋近饱和,难以提取更丰富的信息。因此,从其他模态提取互补信息以补充图像信息变得日益重要。实际上,动作标签提供了表达动作语义的明确文本信息,而现有方法常忽略这一点。为此,我们提出ActivityCLIP——一种即插即用方法,通过挖掘动作标签中的文本信息来补充图像信息,从而增强群体活动识别。ActivityCLIP包含文本分支和图像分支,其中文本分支被嵌入图像分支(即现成的基于图像的方法)。文本分支包含Image2Text模块与关系建模模块。具体而言,我们提出知识迁移模块Image2Text,该模块通过知识蒸馏将图像信息适配为CLIP提取的文本信息。此外,为保持方法便捷性,我们在图像分支关系模块基础上添加少量可训练参数,以建模文本分支中的交互关系。为验证方法的通用性,我们使用ActivityCLIP复现了三种代表性方法,仅增加有限可训练参数即实现显著性能提升。我们还进行了大量消融实验,并与前沿方法对比,证明了ActivityCLIP的有效性。