S2SServiceBench: A Multimodal Benchmark for Last-Mile S2S Climate Services

Subseasonal-to-seasonal (S2S) forecasts play an essential role in providing a decision-critical weeks-to-months planning window for climate resilience and sustainability, yet a growing bottleneck is the last-mile gap: translating scientific forecasts into trusted, actionable climate services, requiring reliable multimodal understanding and decision-facing reasoning under uncertainty. Meanwhile, multimodal large language models (MLLMs) and corresponding agentic paradigms have made rapid progress in supporting various workflows, but it remains unclear whether they can reliably generate decision-making deliverables from operational service products (e.g., actionable signal comprehension, decision-making handoff, and decision analysis & planning) under uncertainty. We introduce S2SServiceBench, a multimodal benchmark for last-mile S2S climate services curated from an operational climate-service system to evaluate this capability. S2SServiceBenchcovers 10 service products with about 150+ expert-selected cases in total, spanning six application domains - Agriculture, Disasters, Energy, Finance, Health, and Shipping. Each case is instantiated at three service levels, yielding around 500 tasks and 1,000+ evaluation items across climate resilience and sustainability applications. Using S2SServiceBench, we benchmark state-of-the-art MLLMs and agents, and analyze performance across products and service levels, revealing persistent challenges in S2S service plot understanding and reasoning - namely, actionable signal comprehension, operationalizing uncertainty into executable handoffs, and stable, evidence-grounded analysis and planning for dynamic hazards-while offering actionable guidance for building future climate-service agents.

翻译：次季节至季节（S2S）预报在提供对气候韧性与可持续性至关重要的数周至数月规划窗口方面发挥着关键作用，然而一个日益突出的瓶颈是"最后一公里"鸿沟：如何将科学预报转化为可信赖、可操作的气候服务，这需要在不确定性下进行可靠的多模态理解和面向决策的推理。与此同时，多模态大语言模型（MLLMs）及其相应的智能体范式在支持各类工作流方面取得了快速进展，但尚不清楚它们能否在不确定性条件下，可靠地从业务化服务产品（例如，可操作信号理解、决策交接、以及决策分析与规划）中生成决策交付物。我们提出了S2SServiceBench，这是一个为评估此能力而构建的多模态基准，其数据来源于一个业务化气候服务系统，专门用于评估最后一公里S2S气候服务能力。S2SServiceBench涵盖了10类服务产品，总计约150多个专家精选案例，横跨农业、灾害、能源、金融、健康与航运六个应用领域。每个案例均在三个服务层级上实例化，在气候韧性与可持续性应用中产生了约500项任务和1000多个评估项。利用S2SServiceBench，我们对最先进的多模态大语言模型和智能体进行了基准测试，并分析了跨产品和服务层级的性能表现，揭示了在S2S服务图表理解和推理方面存在的持续挑战——即可操作信号理解、将不确定性操作化为可执行的决策交接、以及对动态灾害进行稳定且有证据支撑的分析与规划——同时为构建未来的气候服务智能体提供了可操作的指导。