CommandSwarm: Safety-Aware Natural Language-to-Behavior-Tree Generation for Robotic Swarms

Natural-language interfaces can make swarm robotics more accessible to non-expert operators, but they must translate ambiguous user intent into executable swarm behaviors without unsupported actions, malformed programs, or unsafe plans. This paper presents CommandSwarm, a safety-aware language-to-behavior-tree pipeline for generating XML behavior trees (BTs) from speech or text commands. The system combines multilingual translation, command-level safety filtering, constrained prompting, a LoRA-adapted large language model (LLM), and deterministic parser validation against a whitelist of executable swarm primitives. We evaluate eleven open 6.7B--14B parameter LLMs, all using 4-bit quantization, on representative swarm-control scenarios under zero-shot, one-shot, and two-shot prompting. Falcon3-Instruct-10B and Mistral-7B-v3 are the strongest prompt-engineered candidates, reaching BLEU scores above 0.60 and high syntactic validity in few-shot settings. LoRA adaptation of Falcon3-Instruct-10B on a 2,063-example synthetic instruction--BT corpus improves zero-shot BLEU from 0.267 to 0.663, ROUGE-L from 0.366 to 0.692, and parser-accepted syntactic validity from 0% to 72%. Translation experiments further show that SeamlessM4T v2-large and EuroLLM-9B provide the best quality-latency trade-offs for the multilingual front end. The results indicate that compact, quantized, domain-adapted LLMs can generate useful swarm BTs when embedded in a validated systems pipeline. They also show that parser acceptance and safety filtering remain necessary execution gates; generation quality alone is not sufficient for autonomous deployment.

翻译：自然语言接口能够降低非专业操作人员使用集群机器人的门槛，但必须将歧义的用户意图转化为可执行的集群行为，且不支持的操作、格式异常的程序或非安全规划。本文提出CommandSwarm，一种安全感知的语言到行为树流水线，可从语音或文本指令生成XML行为树。该系统融合多语言翻译、指令级安全过滤、约束提示、LoRA适配的大语言模型及确定性解析器验证——对可执行集群原语的许可列表进行校验。我们评估了11个开源的6.7B-14B参数LLM（均采用4比特量化），在零样本、单样本和双样本提示下测试代表性集群控制场景。Falcon3-Instruct-10B和Mistral-7B-v3是表现最优的提示工程候选模型，在少样本设置下BLEU得分超过0.60且句法有效性高。基于2063个示例的合成指令-行为树语料库对Falcon3-Instruct-10B进行LoRA适配后，零样本BLEU从0.267提升至0.663，ROUGE-L从0.366提升至0.692，解析器接受的句法有效性从0%升至72%。翻译实验进一步表明，SeamlessM4T v2-large和EuroLLM-9B为多语言前端提供了最佳质量-延迟权衡。结果表明，紧凑型、量化且领域适配的LLM在嵌入已验证的流水线系统后，能够生成可用的集群行为树。研究也显示，解析器接受度与安全过滤仍是必要的执行门控机制——仅凭生成质量不足以支持自主部署。