We present a new general learning approach, Prompt Learning for Action Recognition (PLAR), which leverages the strengths of prompt learning to guide the learning process. Our approach is designed to predict the action label by helping the models focus on the descriptions or instructions associated with actions in the input videos. Our formulation uses various prompts, including learnable prompts, auxiliary visual information, and large vision models to improve the recognition performance. In particular, we design a learnable prompt method that learns to dynamically generate prompts from a pool of prompt experts under different inputs. By sharing the same objective with the task, our proposed PLAR can optimize prompts that guide the model's predictions while explicitly learning input-invariant (prompt experts pool) and input-specific (data-dependent) prompt knowledge. We evaluate our approach on datasets consisting of both ground camera videos and aerial videos, and scenes with single-agent and multi-agent actions. In practice, we observe a 3.17-10.2% accuracy improvement on the aerial multi-agent dataset Okutamam and a 1.0-3.6% improvement on the ground camera single-agent dataset Something Something V2. We plan to release our code on the WWW.
翻译:我们提出了一种新的通用学习方法——面向动作识别的提示学习(PLAR),该方法利用提示学习的优势来引导学习过程。我们的方法旨在通过帮助模型关注输入视频中与动作相关的描述或指令来预测动作标签。该方案采用多种提示形式,包括可学习提示、辅助视觉信息以及大型视觉模型,以提升识别性能。具体而言,我们设计了一种可学习提示方法,该方法能够根据不同的输入,从提示专家池中动态生成提示。通过与任务共享相同目标,所提出的PLAR能够优化引导模型预测的提示,同时显式学习输入不变(提示专家池)和输入特定(数据依赖)的提示知识。我们在包含地面摄像机视频和航拍视频、以及单智能体和多智能体动作场景的数据集上评估了该方法。在实际应用中,我们在航拍多智能体数据集Okutamam上观察到3.17%-10.2%的准确率提升,在地面摄像机单智能体数据集Something Something V2上观察到1.0%-3.6%的提升。我们计划在互联网上发布相关代码。