Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved-spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer's potential for real-world applications. The source code will be released at https://github.com/yyyujintang/PredFormer .
翻译:时空预测学习方法通常分为两类:基于循环的方法面临并行化与性能挑战,而免循环方法则采用卷积神经网络(CNN)作为编码器-解码器架构。这些方法虽受益于强归纳偏置,但往往以可扩展性和泛化性为代价。本文提出PredFormer——一种基于纯Transformer的时空预测学习框架。受视觉Transformer(ViT)设计的启发,PredFormer在系统分析三维注意力机制(包括全时空、分解时空及交错时空注意力)的基础上,采用了精心设计的门控Transformer模块。凭借其免循环、基于Transformer的设计,PredFormer结构简洁高效,以显著优势超越先前方法。在合成与真实数据集上的大量实验表明,PredFormer实现了最先进的性能。在Moving MNIST数据集上,PredFormer相较于SimVP将均方误差降低51.3%;在TaxiBJ数据集上,模型将均方误差减少33.1%,并将帧率从533提升至2364;在WeatherBench数据集上,模型在将均方误差降低11.1%的同时,将帧率从196提升至404。这些在精度与效率上的双重提升证明了PredFormer在实际应用中的潜力。源代码将在https://github.com/yyyujintang/PredFormer发布。