Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce $α^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $α^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $α^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $α^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench
翻译:自主无人机系统正日益部署于安全关键的网络化环境中,其必须在恶意对手存在的情况下可靠运行。尽管现有基准测试已对基于大语言模型的无人机代理在推理、导航和效率方面进行了评估,但在对抗条件下对安全性、韧性和信任的系统性评估仍很大程度上未被探索,尤其是在新兴的6G赋能场景中。我们提出了α³-SecBench,这是首个用于评估基于LLM的无人机代理在现实对抗干扰下安全感知自主性的大规模评估套件。该框架基于α³-Bench中的多轮对话式无人机任务,通过在七个自主层级(包括传感、感知、规划、控制、通信、边缘/云基础设施以及LLM推理)上叠加20,000个经过验证的安全覆盖攻击场景,对良性任务片段进行增强。α³-SecBench从三个正交维度评估代理:安全性(攻击检测与漏洞归因)、韧性(安全降级行为)和信任(策略合规的工具使用)。我们使用从涵盖175种威胁类型、总计113,475个任务组成的语料库中采样的数千个对抗性增强无人机任务片段,评估了来自主要工业提供商和领先AI实验室的23个最先进的LLM。尽管许多模型能够可靠地检测异常行为,但有效的缓解措施、漏洞归因以及可信的控制行动仍然表现不一致。归一化总体得分范围在12.9%至57.1%之间,凸显了异常检测与安全感知自主决策之间的显著差距。我们已在GitHub上发布α³-SecBench:https://github.com/maferrag/AlphaSecBench