Temporal action segmentation (TAS) in untrimmed videos requires dense temporal supervision. However, most of the annotation cost is spent identifying action transitions where segmentation errors concentrate and small temporal shifts can disproportionately degrade segment-level metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these error-prone boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score. The boundary score fuses neighborhood uncertainty, class ambiguity, and temporal prediction dynamics to reveal the underlying importance of each frame. Importantly, our annotation protocol requests labels only at the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets. Gains are largest on datasets where performance is highly sensitive to boundary placement, as measured by edit and overlap-based F1 metrics.
翻译:时间动作分割(TAS)在未修剪视频中需要密集的时间监督。然而,大部分标注成本耗费在识别动作转换上——这些区域是分割误差的集中点,且微小的时序偏移会不成比例地降低片段级指标。我们提出B-ACT,一种显式将监督资源分配给这些易错边界区域的剪辑预算主动学习框架。B-ACT采用层级化两阶段循环机制:(i)通过预测不确定性对未标注视频进行排序和查询;(ii)在每个选定视频中,利用当前模型预测检测候选转换点,并通过新颖的边界得分选择前$K$个边界。该边界得分融合邻域不确定性、类别模糊性及时序预测动态,以揭示每帧的潜在重要性。值得注意的是,我们的标注协议仅请求边界帧的标签,同时仍通过模型的感受野以边界为中心剪辑进行训练,从而利用时序上下文。在GTEA、50Salads和Breakfast数据集上的大量实验表明,面向边界的监督策略具有显著的标签效率,并在稀疏预算下持续优于代表性TAS主动学习基线及先前最优方法。在通过编辑和基于重叠的F1指标评估时,性能对边界定位高度敏感的数据集上,提升最为显著。