Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.
翻译:传统的时空模型通常依赖于任务特定的架构,由于领域特定的设计要求,这限制了它们在不同任务间的泛化能力和可扩展性。本文介绍了 \textbf{UniSTD},一个基于 Transformer 的统一时空建模框架,其灵感来源于近期基础模型在“预训练-后适配”两阶段范式上的进展。具体而言,我们的工作表明,在 2D 视觉和视觉-文本数据集上进行任务无关的预训练,可以为时空学习建立一个可泛化的模型基础,随后在时空数据集上进行专门的联合训练,以增强任务特定的适应能力。为了提升跨领域的学习能力,我们的框架采用了秩自适应的专家混合适配方法,通过分数插值来松弛离散变量,使其能够在连续空间中进行优化。此外,我们引入了一个时序模块来显式地整合时序动态。我们在一个涵盖 4 个学科、10 个任务的大规模数据集上评估了我们的方法,结果表明,一个统一的时空模型能够实现可扩展的跨任务学习,并能在单一模型内同时支持多达 10 个任务,同时降低多领域应用中的训练成本。代码将在 https://github.com/1hunters/UniSTD 提供。