顺序至关重要：面向近对称动作识别的参数高效图像到视频探测方法 (Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions)

We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.

翻译：本研究针对近对称动作识别这一尚未解决的挑战——即视觉上相似但时间顺序相反的动作（例如打开与关闭瓶子），探讨了参数高效的图像到视频探测方法。现有针对图像预训练模型（如DinoV2和CLIP）的探测机制依赖注意力机制进行时序建模，但其本质上是置换不变的，导致无论帧顺序如何都会产生相同的预测。为解决此问题，我们提出了自注意力时序嵌入探测（STEP），这是一种简单而有效的方法，旨在增强参数高效图像到视频迁移中的时序敏感性。STEP通过三个关键改进增强了自注意力探测机制：（1）可学习的帧级位置编码，显式编码时序顺序；（2）单一的全局CLS标记，用于保持序列连贯性；（3）简化的注意力机制以提高参数效率。在仅使用1/3可学习参数的情况下，STEP在四个活动识别基准测试中优于现有图像到视频探测机制3-15%。在两个数据集上，其性能超越了所有已发表的方法，包括完全微调的模型。STEP在识别近对称动作方面展现出显著优势，比其他探测机制高出9-19%，比基于参数更重的PEFT迁移方法高出5-15%。代码与模型将公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日