In this paper, we propose a novel fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself, without requiring any training data. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions and a novel triplet selection strategy that effectively models temporal and semantic priors to discover actions in the new representational space. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches. The proposed method is evaluated on two widely used benchmark datasets for the action segmentation task and it achieves competitive performance by applying a generic clustering algorithm on the learned representations.
翻译:本文提出了一种新颖的全无监督框架,能够从单个输入视频中学习适用于动作分割任务的动作表示,无需任何训练数据。该方法是一种深度度量学习方法,基于浅层网络,通过作用于相似性分布的三元组损失以及一种新颖的三元组选择策略,有效建模时间与语义先验信息,从而在新的表示空间中发掘动作。在此框架下,我们成功恢复了学习得到的动作表示中的时间边界,其质量优于现有无监督方法。所提方法在两个广泛使用的动作分割基准数据集上进行评估,通过对学习到的表示应用通用聚类算法,取得了具有竞争力的性能。