Online action detection (OAD) aims to identify ongoing actions from streaming video in real-time, without access to future frames. Since these actions manifest at varying scales of granularity, ranging from coarse to fine, projecting an entire set of action frames to a single latent encoding may result in a lack of local information, necessitating the acquisition of action features across multiple scales. In this paper, we propose a multi-scale action learning transformer (MALT), which includes a novel recurrent decoder (used for feature fusion) that includes fewer parameters and can be trained more efficiently. A hierarchical encoder with multiple encoding branches is further proposed to capture multi-scale action features. The output from the preceding branch is then incrementally input to the subsequent branch as part of a cross-attention calculation. In this way, output features transition from coarse to fine as the branches deepen. We also introduce an explicit frame scoring mechanism employing sparse attention, which filters irrelevant frames more efficiently, without requiring an additional network. The proposed method achieved state-of-the-art performance on two benchmark datasets (THUMOS'14 and TVSeries), outperforming all existing models used for comparison, with an mAP of 0.2% for THUMOS'14 and an mcAP of 0.1% for TVseries.
翻译:在线动作检测(OAD)旨在从实时视频流中识别正在进行的动作,而无需访问未来帧。由于这些动作在从粗粒度到细粒度的不同尺度上表现出来,将整个动作帧集投影到单个潜在编码中可能导致局部信息缺失,因此需要获取跨多个尺度的动作特征。本文提出了一种多尺度动作学习Transformer(MALT),它包含一个新颖的循环解码器(用于特征融合),该解码器参数更少且能更高效地训练。进一步提出了一个具有多个编码分支的层次化编码器,以捕获多尺度动作特征。前一个分支的输出随后作为交叉注意力计算的一部分,增量式输入到后续分支。通过这种方式,随着分支加深,输出特征从粗粒度过渡到细粒度。我们还引入了一种采用稀疏注意力的显式帧评分机制,该机制能更高效地过滤无关帧,且无需额外的网络。所提方法在两个基准数据集(THUMOS'14和TVSeries)上取得了最先进的性能,优于所有用于比较的现有模型,在THUMOS'14上获得了0.2%的mAP,在TVSeries上获得了0.1%的mcAP。