Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations.
翻译:时序动作检测旨在预测视频中动作实例的时间区间与类别。尽管现有双流模型具有优异性能,但其依赖计算昂贵的光流导致推理速度缓慢。本文提出一种解耦跨模态蒸馏框架,通过迁移运动模态知识构建强健的RGB检测器。具体而言,我们并非直接进行蒸馏,而是分别学习RGB与运动表征,继而将二者联合以执行动作定位。这种双分支设计与非对称训练目标可在完整保留RGB信息的同时实现运动知识的高效迁移。此外,我们引入局部注意力融合机制以更好利用多模态互补性。该机制旨在保留对动作定位至关重要的特征局部判别能力。基准数据集上的大量实验验证了所提方法在增强RGB动作检测器方面的有效性。值得注意的是,本框架与骨干网络及检测头无关,可为不同模型组合带来一致性能提升。