Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce $\texttt{DINO-Foresight}$, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show that our framework outperforms existing methods, demonstrating its robustness and scalability. Additionally, we highlight how intermediate transformer representations in $\texttt{DINO-Foresight}$ improve downstream task performance, offering a promising path for the self-supervised enhancement of VFM features. We provide the implementation code at https://github.com/Sta8is/DINO-Foresight .
翻译:预测未来动态对于自动驾驶和机器人等应用至关重要,理解环境是其关键。现有的像素级方法计算成本高昂,且常关注无关细节。为应对这些挑战,我们提出了$\texttt{DINO-Foresight}$,一种在预训练视觉基础模型(VFMs)的语义特征空间中运行的新型框架。我们的方法以自监督方式训练一个掩码特征Transformer,以预测VFM特征随时间的演化。通过预测这些特征,我们可以应用现成的、针对特定任务的头部网络来完成各种场景理解任务。在此框架中,VFM特征被视为一个潜在空间,不同的头部网络可附加其上,以执行未来帧分析中的特定任务。大量实验表明,我们的框架优于现有方法,证明了其鲁棒性和可扩展性。此外,我们强调了$\texttt{DINO-Foresight}$中Transformer的中间表示如何提升下游任务性能,为VFM特征的自监督增强提供了一条有前景的路径。我们在 https://github.com/Sta8is/DINO-Foresight 提供了实现代码。