Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

The emerging field of action prediction plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The Temporal-DINO approach employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing past frames. The strategy is evaluated on ROAD dataset for the action prediction downstream task using 3D-ResNet, Transformer, and LSTM architectures. The experimental results showcase significant improvements in prediction performance across these architectures, with our method achieving an average enhancement of 9.9% Precision Points (PP), highlighting its effectiveness in enhancing the backbones' capabilities of capturing long-term dependencies. Furthermore, our approach demonstrates efficiency regarding the pretraining dataset size and the number of epochs required. This method overcomes limitations present in other approaches, including considering various backbone architectures, addressing multiple prediction horizons, reducing reliance on hand-crafted augmentations, and streamlining the pretraining process into a single stage. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.

翻译：新兴的动作预测领域在自动驾驶、活动分析和人机交互等众多计算机视觉应用中发挥着关键作用。尽管取得了显著进展，但由于视频数据固有的高维度、复杂动态和不确定性，准确预测未来动作仍是一个具有挑战性的问题。传统的监督方法需要大量标注数据，而这既昂贵又耗时。本文提出了一种受DINO（无标签自蒸馏）启发的全新自监督视频策略，用于增强动作预测。时间DINO方法采用两个模型：一个处理过去帧的“学生”模型，以及一个同时处理过去和未来帧的“教师”模型，从而提供更广泛的时间上下文。在训练过程中，教师引导学生仅通过观察过去帧来学习未来上下文。该策略在ROAD数据集上进行了评估，使用3D-ResNet、Transformer和LSTM架构执行动作预测下游任务。实验结果表明，这些架构的预测性能均得到了显著提升，我们的方法平均提高了9.9%的精确度点（PP），凸显了其在增强骨干网络捕获长期依赖能力方面的有效性。此外，我们的方法在预训练数据集大小和所需训练轮数方面展现了高效性。该方法克服了其他方法中存在的局限性，包括兼容多种骨干架构、应对多种预测时间范围、减少对人工数据增强的依赖，并将预训练过程简化为单一阶段。这些发现揭示了我们的方法在各类视频任务（如活动识别、运动规划和场景理解）中的潜力。