In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermore, their high computational demands could impede their applications for high-resolution videos. To address these constraints, we introduce a novel approach called Physics-assisted Spatio-temporal Network (PastNet) for generating high-quality video predictions. The core of our PastNet lies in incorporating a spectral convolution operator in the Fourier domain, which efficiently introduces inductive biases from the underlying physical laws. Additionally, we employ a memory bank with the estimated intrinsic dimensionality to discretize local features during the processing of complex spatio-temporal signals, thereby reducing computational costs and facilitating efficient high-resolution video prediction. Extensive experiments on various widely-used datasets demonstrate the effectiveness and efficiency of the proposed PastNet compared with state-of-the-art methods, particularly in high-resolution scenarios.
翻译:本文研究时空视频预测的挑战,即基于历史数据流生成未来视频。现有方法通常利用语义图等外部信息来增强视频预测,但往往忽略了视频中固有的物理知识。此外,其高计算需求可能阻碍在高分辨率视频中的应用。为应对这些限制,我们提出一种名为物理辅助时空网络(PastNet)的新方法,用于生成高质量视频预测。PastNet的核心在于引入傅里叶域中的谱卷积算子,从而高效地引入基于物理定律的归纳偏置。此外,我们利用估计固有维度的记忆库,在处理复杂时空信号时离散化局部特征,从而降低计算成本并促进高效的高分辨率视频预测。在多种广泛使用的数据集上进行的大量实验表明,与最先进方法相比,所提出的PastNet在效果和效率上均具优势,尤其在高分辨率场景下表现突出。