Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyujintang/VMRNN-PyTorch.
翻译:将CNN或ViT与RNN结合用于时空预测,在预测时空动态方面取得了前所未有的成果。然而,建模广泛的全局信息仍是一项艰巨挑战:CNN受限于狭窄的感受野,而ViT则因注意力机制的高计算需求而面临困难。近期基于Mamba的架构因其卓越的长序列建模能力备受关注,在效率和精度上超越了现有视觉模型,这激励我们开发一种专为时空预测设计的创新架构。本文提出VMRNN单元——一种融合视觉Mamba模块与LSTM优势的新型循环单元。我们构建了以VMRNN单元为核心的网络,以有效解决时空预测任务。大量评估表明,我们的方法在多种任务上取得了具有竞争力的结果,同时保持了更小的模型规模。我们的代码已开源在https://github.com/yyyujintang/VMRNN-PyTorch。