Many tasks in video analysis and understanding boil down to the need for frame-based feature learning, aiming to encapsulate the relevant visual content so as to enable simpler and easier subsequent processing. While supervised strategies for this learning task can be envisioned, self and weakly-supervised alternatives are preferred due to the difficulties in getting labeled data. This paper introduces LRProp -- a novel weakly-supervised representation learning approach, with an emphasis on the application of temporal alignment between pairs of videos of the same action category. The proposed approach uses a transformer encoder for extracting frame-level features, and employs the DTW algorithm within the training iterations in order to identify the alignment path between video pairs. Through a process referred to as ``pair-wise position propagation'', the probability distributions of these correspondences per location are matched with the similarity of the frame-level features via KL-divergence minimization. The proposed algorithm uses also a regularized SoftDTW loss for better tuning the learned features. Our novel representation learning paradigm consistently outperforms the state of the art on temporal alignment tasks, establishing a new performance bar over several downstream video analysis applications.
翻译:视频分析与理解中的许多任务归根结底在于帧级特征学习的需求,旨在封装相关视觉内容以简化后续处理流程。虽然此类学习任务的监督策略是可设想的,但由于获取标注数据存在困难,自监督和弱监督方法更受青睐。本文提出LRProp——一种新颖的弱监督表示学习方法,重点聚焦于同一动作类别视频对之间的时序对齐应用。该方法采用Transformer编码器提取帧级特征,并在训练迭代过程中引入DTW算法以识别视频对之间的对齐路径。通过称为"成对位置传播"的过程,每个位置对应关系的概率分布通过KL散度最小化与帧级特征的相似性进行匹配。所提算法还采用正则化SoftDTW损失以更好地优化学习特征。我们新颖的表示学习范式在时序对齐任务上持续超越现有技术水平,为多个下游视频分析应用树立了新的性能标杆。