Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Gated Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.
翻译:时序动作定位(TAL)涉及在未修剪视频中定位和分类动作片段。大规模视频基础模型的出现,使得仅使用RGB的视频骨干网络性能超越了以往需要RGB和光流两种模态的方法。由于为TAL任务适配视频骨干网络所需的GPU内存极大,利用这些大模型通常仅限于仅训练TAL头部。为克服这一限制,我们提出了LoSA,这是首个专为处理未修剪视频的TAL任务设计的内存与参数高效的骨干网络适配器。LoSA通过引入长短程适配器,使视频骨干网络的中间层能够适应不同的时间范围,从而专门针对TAL任务进行优化。这些适配器与视频骨干网络并行运行,显著减少了内存占用。LoSA还包含长短程门控融合模块,该模块策略性地结合来自视频骨干网络各层的这些适配器输出,以增强提供给TAL头部的视频特征。实验表明,通过将端到端骨干网络适配扩展到如VideoMAEv2~(ViT-g)等数十亿参数级别的模型,并超越仅头部迁移学习的局限,LoSA在标准TAL基准测试THUMOS-14和ActivityNet-v1.3上显著优于所有现有方法。