In this paper, we tackle the problem of video alignment, the process of matching the frames of a pair of videos containing similar actions. The main challenge in video alignment is that accurate correspondence should be established despite the differences in the execution processes and appearances between the two videos. We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame using three machine vision tools: person detection, pose estimation, and VGG network. Then, the features are processed and combined to construct a multidimensional time series that represents the video. The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW). The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it. For evaluation, we considered video synchronization and phase classification tasks on the Penn action dataset. Also, for an effective evaluation of the video synchronization task, we present a new metric called Enclosed Area Error(EAE). The results show that our method outperforms previous state-of-the-art methods, such as TCC, and other self-supervised and weakly supervised methods.
翻译:本文针对视频对齐问题展开研究,该问题旨在匹配包含相似动作的视频对中的帧。视频对齐的主要挑战在于,尽管两个视频在执行过程和外观上存在差异,仍需建立精确的对应关系。我们提出了一种无监督对齐方法,该方法利用帧的全局和局部特征。具体而言,我们通过三种机器视觉工具为每个视频帧提取有效特征:人体检测、姿态估计和VGG网络。随后对这些特征进行处理与融合,构建表示视频的多维时间序列。利用所得时间序列,通过一种名为对角线动态时间规整(DDTW)的新型动态时间规整算法对相同动作的视频进行对齐。该方法的主要优势在于无需训练,因此可适用于任何新型动作而无需收集训练样本。在评估方面,我们在Penn动作数据集上进行了视频同步和相位分类任务测试。此外,为有效评估视频同步任务,我们提出了一种名为包围面积误差(EAE)的新指标。实验结果表明,我们的方法优于TCC等现有最优方法及其他自监督和弱监督方法。