PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e.g., ActivityNet, THUMOS). However, this setting might be unrealistic as different classes of actions often co-occur in practice. In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video. Multi-label TAD is more challenging as it requires for fine-grained class discrimination within a single video and precise localization of the co-occurring instances. To mitigate this issue, we extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD. Specifically, our PointTAD introduces a small set of learnable query points to represent the important frames of each action instance. This point-based representation provides a flexible mechanism to localize the discriminative frames at boundaries and as well the important frames inside the action. Moreover, we perform the action decoding process with the Multi-level Interactive Module to capture both point-level and instance-level action semantics. Finally, our PointTAD employs an end-to-end trainable framework simply based on RGB input for easy deployment. We evaluate our proposed method on two popular benchmarks and introduce the new metric of detection-mAP for multi-label TAD. Our model outperforms all previous methods by a large margin under the detection-mAP metric, and also achieves promising results under the segmentation-mAP metric. Code is available at https://github.com/MCG-NJU/PointTAD.

翻译：传统时序动作检测（TAD）通常处理包含少量单标签动作实例的未裁剪视频（如ActivityNet、THUMOS数据集）。然而，这种设定可能不够现实，因为不同类别的动作在实际场景中往往会同时出现。本文聚焦于多标签时序动作检测任务，旨在定位多标签未裁剪视频中的所有动作实例。多标签TAD更具挑战性，它要求在同一视频中实现细粒度类别区分，并精准定位共现实例。为解决这一问题，我们将稀疏查询检测范式从传统TAD进行扩展，提出了多标签TAD框架PointTAD。具体而言，PointTAD引入少量可学习查询点来表示每个动作实例的关键帧。这种基于点的表示机制能够灵活定位动作边界处的判别帧以及动作内部的重要帧。此外，我们通过多层级交互模块执行动作解码过程，以捕获点级和实例级的动作语义。最后，PointTAD采用仅基于RGB输入的端到端可训练框架，便于部署。我们在两个主流基准上评估所提方法，并引入了检测-mAP这一新指标用于多标签TAD评估。在检测-mAP指标下，我们的方法以显著优势超越所有现有方法，同时在分割-mAP指标上也取得了可观的结果。代码已开源：https://github.com/MCG-NJU/PointTAD。