Temporal Action Localization (TAL) has been extensively studied in generic video understanding, while fine-grained sports scenarios, such as professional badminton, remain underexplored due to their complex and subtle spatio-temporal dynamics. In this paper, we focus on fine-grained TAL in professional badminton videos and introduce a new benchmark dataset, Fine-Badminton, which consists of 31 matches with 29 fine-grained stroke categories, covering 2104 rallies and 27597 annotated actions. To effectively capture the intricate motion patterns in such scenarios, we propose a Decoupling Spatio-Temporal Adapter (DSTA), which enables efficient modeling of spatio-temporal features within a parameter-efficient framework. Specifically, DSTA decomposes motion representation into three parallel branches, capturing temporal dynamics as well as vertical and horizontal spatial variations. The design allows the model to better distinguish subtle differences among fine-grained actions. Extensive experiments on both the Fine-Badminton dataset and the ShuttleSet benchmark demonstrate that the proposed method achieves state-of-the-art performance while introducing only a marginal increase in computational and parameter cost. These results validate the effectiveness and efficiency of the proposed approach for fine-grained temporal action localization.
翻译:时序动作定位(Temporal Action Localization, TAL)在通用视频理解领域已得到广泛研究,然而在专业羽毛球等细粒度运动场景中,由于存在复杂而微妙的时空动态,相关研究仍较匮乏。本文聚焦于专业羽毛球视频中的细粒度时序动作定位,并引入一个新的基准数据集Fine-Badminton,该数据集包含31场比赛、29个细粒度击球类别,覆盖2104个回合及27597个标注动作。为有效捕捉此类场景中的复杂运动模式,我们提出一种解耦时空适配器(Decoupling Spatio-Temporal Adapter, DSTA),能够在参数高效框架内对时空特征进行高效建模。具体而言,DSTA将运动表示分解为三个并行分支,分别捕捉时间动态以及垂直和水平空间变化。该设计使模型能够更好地区分细粒度动作之间的细微差异。在Fine-Badminton数据集和ShuttleSet基准上的大量实验表明,所提方法在仅引入微小计算量和参数开销的情况下,达到了最先进性能。这些结果验证了所提方法在细粒度时序动作定位中的有效性和高效性。