As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.
翻译:随着大语言模型(LLMs)被应用于日益复杂的长任务场景,对能够评估选择性阅读与异构多模态信息整合能力的现实长上下文基准的需求日益增长。这一需求在地理空间规划问题(例如大规模军事行动规划)中尤为迫切,此类任务要求对地图、命令、情报报告等分布式数据进行快速准确的推理。为填补这一空白,我们提出了MilSCORE(军事场景上下文推理)——据我们所知,这是首个基于用于训练的复杂模拟军事规划场景、由专家撰写的多跳问题构成的场景级数据集。MilSCORE旨在评估高风险决策与规划能力,探究LLMs跨多源整合战术与空间推理的能力,以及对长时域、地理空间信息丰富的上下文进行推理的能力。该基准包含七大类多样化题型,涵盖事实回忆以及对约束条件、策略和空间分析的多步推理。我们提供了评估协议,并报告了一系列当代视觉-语言模型的基线结果。我们的研究结果表明,当前系统在现实场景级长上下文规划任务上存在显著不足,这突显了MilSCORE作为未来研究挑战性测试平台的价值。