This paper presents an unsupervised transformer-based framework for temporal activity segmentation which leverages not only frame-level cues but also segment-level cues. This is in contrast with previous methods which often rely on frame-level information only. Our approach begins with a frame-level prediction module which estimates framewise action classes via a transformer encoder. The frame-level prediction module is trained in an unsupervised manner via temporal optimal transport. To exploit segment-level information, we utilize a segment-level prediction module and a frame-to-segment alignment module. The former includes a transformer decoder for estimating video transcripts, while the latter matches frame-level features with segment-level features, yielding permutation-aware segmentation results. Moreover, inspired by temporal optimal transport, we introduce simple-yet-effective pseudo labels for unsupervised training of the above modules. Our experiments on four public datasets, i.e., 50 Salads, YouTube Instructions, Breakfast, and Desktop Assembly show that our approach achieves comparable or better performance than previous methods in unsupervised activity segmentation.
翻译:本文提出了一种基于Transformer的无监督时序活动分割框架,该框架不仅利用帧级线索,还利用片段级线索。与以往通常仅依赖帧级信息的方法不同,我们的方法首先通过帧级预测模块,借助Transformer编码器估计逐帧动作类别。该帧级预测模块通过时序最优传输进行无监督训练。为利用片段级信息,我们采用了片段级预测模块和帧到片段对齐模块:前者包含用于估计视频脚本的Transformer解码器,后者通过匹配帧级特征与片段级特征,生成排列感知的分割结果。此外,受时序最优传输启发,我们引入了简单而有效的伪标签来无监督训练上述模块。在四个公开数据集(50 Salads、YouTube Instructions、Breakfast和Desktop Assembly)上的实验表明,我们的方法在无监督活动分割任务中达到了与先前方法相当或更优的性能。