Video prediction, predicting future frames from the previous ones, has broad applications such as autonomous driving and weather forecasting. Existing state-of-the-art methods typically focus on extracting either spatial, temporal, or spatiotemporal features from videos. Different feature focuses, resulting from different network architectures, may make the resultant models excel at some video prediction tasks but perform poorly on others. Towards a more generic video prediction solution, we explicitly model these features in a unified encoder-decoder framework and propose a novel simple alternating Mixer (SIAM). The novelty of SIAM lies in the design of dimension alternating mixing (DaMi) blocks, which can model spatial, temporal, and spatiotemporal features through alternating the dimensions of the feature maps. Extensive experimental results demonstrate the superior performance of the proposed SIAM on four benchmark video datasets covering both synthetic and real-world scenarios.
翻译:视频预测,即根据先前帧预测未来帧,在自动驾驶和天气预报等领域具有广泛应用。现有最先进的方法通常侧重于从视频中提取空间、时间或时空特征。不同网络架构所导致的不同特征侧重,可能使模型在某些视频预测任务中表现出色,但在其他任务中性能欠佳。为探索更通用的视频预测方案,我们明确地在统一编码器-解码器框架中建模这些特征,并提出一种新颖的简单交替混合器(SIAM)。SIAM的创新之处在于设计了维度交替混合(DaMi)模块,该模块可通过交替特征图的维度来建模空间、时间及时空特征。大量实验结果表明,所提出的SIAM在覆盖合成场景和真实世界场景的四个基准视频数据集上均展现出卓越性能。