In this paper, a self-supervised model that simultaneously predicts a sequence of future frames from video-input with a novel spatial-temporal attention (ST) network is proposed. The ST transformer network allows constraining both temporal consistency across future frames whilst constraining consistency across spatial objects in the image at different scales. This was not the case in prior works for depth prediction, which focused on predicting a single frame as output. The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods, whilst also constraining the motion and geometry from a sequence of input images. Apart from the transformer architecture, one of the main contributions with respect to prior works lies in the objective function that enforces spatio-temporal consistency across a sequence of output frames rather than a single output frame. As will be shown, this results in more accurate and robust depth sequence forecasting. The model achieves highly accurate depth forecasting results that outperform existing baselines on the KITTI benchmark. Extensive ablation studies were performed to assess the effectiveness of the proposed techniques. One remarkable result of the proposed model is that it is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
翻译:本文提出了一种自监督模型,该模型通过新颖的时空注意力网络从视频输入中同步预测未来帧序列。该ST变换器网络能够在约束未来帧之间时间一致性的同时,约束图像中不同尺度空间对象的一致性。这与以往专注于单帧输出的深度预测工作不同。所提出的模型借鉴了单图像深度推理方法中目标形状和纹理等场景先验知识,同时利用输入图像序列约束运动与几何结构。除变换器架构外,与先前工作的主要创新之一在于其目标函数:该函数强制约束输出帧序列(而非单个输出帧)的时空一致性。如后续所示,这将带来更精确、更鲁棒的深度序列预测。该模型在KITTI基准上实现了高精度深度预测,优于现有基线方法。通过广泛的消融研究评估了所提技术的有效性。模型的一个显著特性在于,它能够隐式预测场景中物体的运动,而无需涉及多目标检测、分割与跟踪的复杂模型。