Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitations, lack robustness testing for fog of war, and inadequately evaluate explicit reasoning. To address these vulnerabilities, we present WARBENCH, a comprehensive evaluation framework establishing a foundational tactical baseline alongside four distinct stress testing dimensions. Through a large scale empirical evaluation of nine leading models on 136 high-fidelity historical scenarios, we reveal severe structural flaws. First, baseline tactical reasoning systematically collapses under complex terrain and high force asymmetry. Second, while state of the art closed source models maintain functional compliance, edge-optimized small models expose extreme operational risks with legal violation rates approaching 70 percent. Furthermore, models experience catastrophic performance degradation under 4-bit quantization and systematic information loss. Conversely, explicit reasoning mechanisms serve as highly effective structural safeguards against inadvertent violations. Ultimately, these findings demonstrate that current models remain fundamentally unready for autonomous deployment in high stakes tactical environments.
翻译:大型语言模型正日益被考虑部署于安全关键的军事应用场景。然而,现有基准存在结构性盲区,系统性地高估了模型在真实战术场景中的能力。当前框架通常忽视基于国际人道法的严格法律约束,忽略边缘计算限制,缺乏对战争迷雾的鲁棒性测试,且未能充分评估显式推理能力。针对这些脆弱性,我们提出WARBENCH——一个综合评估框架,建立了基础战术基线及四个不同的压力测试维度。通过对九个领先模型在136个高保真历史场景中的大规模实证评估,我们揭示了严重的结构性缺陷。首先,基础战术推理在复杂地形与高兵力不对称条件下系统性崩溃。其次,尽管最先进闭源模型保持功能合规,但边缘优化的小型模型暴露出极端操作风险,违法率接近70%。此外,模型在4比特量化与系统性信息丢失情况下出现灾难性性能退化。相反,显式推理机制可作为防范无意违规的高效结构性保障。最终,这些发现表明,当前模型在高风险战术环境中仍远未做好自主部署的准备。