Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision-makers in cooperative multi-agent tactical deconfliction using fine-tuning strategies that align model outputs to human operator heuristics. We propose a simulation-to-language data generation pipeline based on the BlueSky air traffic simulator that produces rule-consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen-Math-7B model is fine-tuned using two parameter-efficient strategies: supervised fine-tuning with Low-Rank Adaptation (LoRA) and preference-based fine-tuning combining LoRA with Group-Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed-loop simulations demonstrate that supervised LoRA fine-tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid-air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.

翻译：小型无人机系统（sUASs）在低空空域的日益普及，增加了在安全关键约束下实现可靠战术防冲突的需求。战术防冲突涉及在密集、部分可观测、异构多智能体环境中的短时域决策制定，必须在保持协同间隔保证的同时兼顾运行效率。尽管大语言模型（LLMs）展现出强大的推理能力，但其在空管领域的直接应用仍受限于领域知识嵌入不足与不可预测的输出不一致性。本文研究LLMs作为决策主体在协同多智能体战术防冲突中的应用，采用微调策略使模型输出与人工操作员启发式规则对齐。我们提出基于BlueSky空域仿真器的仿真-语言数据生成流水线，可生成符合既有安全实践的规则一致性防冲突数据集。采用两种参数高效策略对预训练Qwen-Math-7B模型进行微调：基于低秩适配（LoRA）的有监督微调，以及结合LoRA与群组相对策略优化（GRPO）的偏好型微调。验证数据集与闭环仿真实验结果表明，相较于预训练LLM，有监督LoRA微调显著提升了决策准确率、一致性和间隔保持性能，且大幅降低了近乎空中相撞事件。GRPO虽带来额外协同增益，但在与异构智能体策略交互时鲁棒性有所下降。