Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Given limited GPU memory, training TAL end to end (i.e., from videos to predictions) on long videos is a significant challenge. Most methods can only train on pre-extracted features without optimizing them for the localization problem, consequently limiting localization performance. In this work, to extend the potential in TAL networks, we propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone with reversible modules, where the input can be recovered from the output such that the bulky intermediate activations can be cleared from memory during training. Instead of designing one single type of reversible module, we propose a network rewiring mechanism, to transform any module with a residual connection to a reversible module without changing any parameters. This provides two benefits: (1) a large variety of reversible networks are easily obtained from existing and even future model designs, and (2) the reversible models require much less training effort as they reuse the pre-trained parameters of their original non-reversible versions. Re2TAL, only using the RGB modality, reaches 37.01% average mAP on ActivityNet-v1.3, a new state-of-the-art record, and mAP 64.9% at tIoU=0.5 on THUMOS-14, outperforming all other RGB-only methods.
翻译:时序动作定位(TAL)需要长程推理以预测不同时长和复杂内容的动作。在GPU内存有限的情况下,对长视频进行端到端(即从视频到预测)的TAL训练是一项重大挑战。大多数方法只能基于预提取特征进行训练,而无法对这些特征针对定位问题进行优化,从而限制了定位性能。为拓展TAL网络的潜力,本文提出一种新型端到端方法Re²TAL,通过重连预训练视频骨干网络实现可逆TAL。Re²TAL构建了包含可逆模块的骨干网络,其输入可从输出中恢复,从而在训练期间清除内存中的庞大中间激活值。我们并非设计单一类型的可逆模块,而是提出一种网络重连机制,可将任意具有残差连接的模块改造为可逆模块,且无需修改任何参数。这带来两大优势:(1)可从现有甚至未来的模型设计中轻松获得种类丰富的可逆网络;(2)可逆模型因复用其原始不可逆版本的预训练参数,所需训练代价大幅降低。仅使用RGB模态的Re²TAL在ActivityNet-v1.3上达到37.01%的平均mAP,创下新纪录;在THUMOS-14上于tIoU=0.5时获得mAP 64.9%,优于所有其他仅使用RGB的方法。