We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.
翻译:本研究针对近对称动作识别这一尚未解决的挑战——即视觉上相似但时间顺序相反的动作(例如打开与关闭瓶子),探讨了参数高效的图像到视频探测方法。现有针对图像预训练模型(如DinoV2和CLIP)的探测机制依赖注意力机制进行时序建模,但其本质上是置换不变的,导致无论帧顺序如何都会产生相同的预测。为解决此问题,我们提出了自注意力时序嵌入探测(STEP),这是一种简单而有效的方法,旨在增强参数高效图像到视频迁移中的时序敏感性。STEP通过三个关键改进增强了自注意力探测机制:(1)可学习的帧级位置编码,显式编码时序顺序;(2)单一的全局CLS标记,用于保持序列连贯性;(3)简化的注意力机制以提高参数效率。在仅使用1/3可学习参数的情况下,STEP在四个活动识别基准测试中优于现有图像到视频探测机制3-15%。在两个数据集上,其性能超越了所有已发表的方法,包括完全微调的模型。STEP在识别近对称动作方面展现出显著优势,比其他探测机制高出9-19%,比基于参数更重的PEFT迁移方法高出5-15%。代码与模型将公开发布。