Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results.
翻译:近年来,未修剪视频中的时序动作分割受到了越来越多的关注。然而,标注动作类别和逐帧边界耗时且成本高昂,尤其是在大规模数据集上。为解决这一问题,我们提出了一种从未修剪视频序列中学习动作类别的无监督方法。具体而言,我们设计了一个时序嵌入网络,该网络结合了相对时间预测、特征重建和序列到序列学习,以保留视频特征的空间布局和时序特性。随后,通过对这些嵌入特征表示执行两步聚类流程,我们能够在视频内部以及跨视频间强化时序一致性。基于识别出的聚类,我们将视频解码为与语义上有意义的动作类别相对应的连贯时序片段。在三个具有挑战性的数据集上的评估结果表明了各组成部分的影响,并进一步展示了我们在无监督动作分割方面达到的最先进水平。