Temporal action localization aims to identify the boundaries and categories of actions in videos, such as scoring a goal in a football match. Single-frame supervision has emerged as a labor-efficient way to train action localizers as it requires only one annotated frame per action. However, it often suffers from poor performance due to the lack of precise boundary annotations. To address this issue, we propose a visual analysis method that aligns similar actions and then propagates a few user-provided annotations (e.g. , boundaries, category labels) to similar actions via the generated alignments. Our method models the alignment between actions as a heaviest path problem and the annotation propagation as a quadratic optimization problem. As the automatically generated alignments may not accurately match the associated actions and could produce inaccurate localization results, we develop a storyline visualization to explain the localization results of actions and their alignments. This visualization facilitates users in correcting wrong localization results and misalignments. The corrections are then used to improve the localization results of other actions. The effectiveness of our method in improving localization performance is demonstrated through quantitative evaluation and a case study.
翻译:时序动作定位旨在识别视频中动作的边界和类别,例如足球比赛中进球的动作。单帧监督作为一种节省人力的动作定位器训练方式,仅需每个动作提供一个标注帧。然而,由于缺乏精确的边界标注,该方法通常性能较差。为解决此问题,我们提出一种视觉分析方法,该方法先对齐相似动作,再通过生成的对齐将少量用户提供的标注(如边界、类别标签)传播至相似动作。我们将动作间的对齐建模为最重路径问题,将标注传播建模为二次优化问题。由于自动生成的对齐可能无法精确匹配相关动作,导致定位结果不准确,我们开发了一种故事线可视化方法,用于解释动作的定位结果及其对齐关系。该可视化帮助用户纠正错误的定位结果和对齐错误,随后利用这些修正改进其他动作的定位结果。通过定量评估和案例研究,验证了本方法在提升定位性能方面的有效性。