Given the difficulty of manually annotating motion in video, the current best motion estimation methods are trained with synthetic data, and therefore struggle somewhat due to a train/test gap. Self-supervised methods hold the promise of training directly on real video, but typically perform worse. These include methods trained with warp error (i.e., color constancy) combined with smoothness terms, and methods that encourage cycle-consistency in the estimates (i.e., tracking backwards should yield the opposite trajectory as tracking forwards). In this work, we take on the challenge of improving state-of-the-art supervised models with self-supervised training. We find that when the initialization is supervised weights, most existing self-supervision techniques actually make performance worse instead of better, which suggests that the benefit of seeing the new data is overshadowed by the noise in the training signal. Focusing on obtaining a ``clean'' training signal from real-world unlabelled video, we propose to separate label-making and training into two distinct stages. In the first stage, we use the pre-trained model to estimate motion in a video, and then select the subset of motion estimates which we can verify with cycle-consistency. This produces a sparse but accurate pseudo-labelling of the video. In the second stage, we fine-tune the model to reproduce these outputs, while also applying augmentations on the input. We complement this boot-strapping method with simple techniques that densify and re-balance the pseudo-labels, ensuring that we do not merely train on ``easy'' tracks. We show that our method yields reliable gains over fully-supervised methods in real videos, for both short-term (flow-based) and long-range (multi-frame) pixel tracking.
翻译:鉴于视频中运动的手动标注困难,当前最优的运动估计方法均使用合成数据训练,因此受训练/测试差异影响而表现欠佳。自监督方法虽有望直接在真实视频上训练,但通常性能较差。这类方法包括基于扭曲误差(即颜色恒常性)联合平滑约束的训练方法,以及鼓励估计结果满足循环一致性(即反向跟踪应得到与正向跟踪相反的轨迹)的方法。本研究致力于通过自监督训练改进现有最优的监督模型。我们发现,当初始化权重为监督预训练参数时,大多数现有自监督技术实际上会降低而非提升性能,这表明新数据带来的收益被训练信号中的噪声所掩盖。为从真实无标注视频中获取"纯净"训练信号,我们提出将标签生成与训练分离为两个独立阶段:第一阶段利用预训练模型估计视频运动,并通过循环一致性验证筛选可信任的运动估计子集,从而生成稀疏但准确的视频伪标签;第二阶段在输入数据增强的同时微调模型以复现这些输出。我们为该自举方法补充了用于稠密化与平衡伪标签的简易技术,确保模型不会仅针对"简单"轨迹进行训练。实验证明,本方法在真实视频的短期(基于光流)和长期(多帧)像素跟踪任务中均较全监督方法获得可靠提升。