Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyujintang/VMRNN-PyTorch.
翻译:将CNN或ViT与RNN结合用于时空预测,已在预测时空动态方面取得了无可比拟的成果。然而,建模广泛的全局信息仍然是一个艰巨的挑战:CNN受限于其狭窄的感受野,而ViT则因其注意力机制的巨大计算需求而面临困难。近期基于Mamba的架构因其卓越的长序列建模能力而受到广泛关注,在效率和准确性上超越了现有的视觉模型,这激励我们开发一种专为时空预测量身定制的新型架构。本文提出VMRNN单元,这是一种新的循环单元,它融合了视觉Mamba块与LSTM的优势。我们构建了一个以VMRNN单元为核心的网络,以有效处理时空预测任务。我们广泛的评估表明,所提出的方法在各种任务上均取得了具有竞争力的结果,同时保持了较小的模型规模。我们的代码可在 https://github.com/yyyujintang/VMRNN-PyTorch 获取。