Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120$ \times $ larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).
翻译:遥感世界模型旨在同时解释观测到的变化并预测合理的未来场景,这两项任务共享时空先验。然而,现有方法通常将它们分开处理,限制了跨任务的迁移能力。本文提出RS-WorldModel,一个用于遥感的统一世界模型,它联合处理时空变化理解与文本引导的未来场景预测,并构建了RSWBench-1.1M数据集,这是一个包含110万个样本、覆盖两项任务且具有丰富语言标注的数据集。RS-WorldModel的训练分为三个阶段:(1) 地理感知生成式预训练(GAGP)将预测条件置于地理和采集元数据之上;(2) 协同指令微调(SIT)联合训练理解与预测任务;(3) 可验证强化优化(VRO)利用可验证的、任务特定的奖励来优化输出。仅使用20亿参数,RS-WorldModel在大多数时空变化问答指标上超越了参数量高达其120$ \times $的开源模型。在文本引导的未来场景预测任务中,其FID达到43.13,优于所有开源基线以及闭源的Gemini-2.5-Flash Image (Nano Banana)。