We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
翻译:我们提出COMEDIAN——一种创新的时空Transformer初始化框架,用于动作定位任务,融合了自监督学习与知识蒸馏技术。动作定位是时间戳级别的时序动作检测任务。该框架包含三个步骤及两个初始化阶段。首先,通过输入短视频片段对空间Transformer进行自监督初始化;其次,利用与每个短视频段对齐的预计算特征库,通过知识蒸馏技术初始化时间Transformer,以增强空间Transformer输出的全局上下文信息;最后,针对动作定位任务对Transformer进行微调。在SoccerNet-v2数据集上的实验表明,该框架达到了最先进的性能,并验证了COMEDIAN预训练范式的有效性。结果突出了预训练框架的多项优势,包括相比非预训练模型更优的性能表现和更快的收敛速度。