Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall, their performance remains subpar on reasoning-intensive tasks, such as threat actor attribution and risk mitigation, with open-source models trailing even further behind. These findings highlight fundamental limitations in the reasoning capabilities of current LLMs and underscore the need for models explicitly tailored to CTI workflows and automation.
翻译:大语言模型(LLMs)在自然语言推理方面展现出强大能力,但其在网络威胁情报(CTI)领域的应用仍显不足。CTI分析涉及从海量非结构化报告中提炼可操作知识,该过程有望通过LLMs显著减轻分析人员的工作负担。CTIBench曾提出一个用于评估LLMs在多项CTI任务表现的综合性基准。本研究通过开发AthenaBench对CTIBench进行扩展,该增强型基准包含改进的数据集构建流程、重复数据剔除机制、优化的评估指标以及聚焦于风险缓解策略的新任务。我们评估了十二个LLM,包括GPT-5和Gemini-2.5 Pro等前沿专有模型,以及来自LLaMA和Qwen系列的七个开源模型。尽管专有LLMs整体表现更优,但在威胁行为者归因和风险缓解等推理密集型任务上仍显不足,开源模型的表现差距更为明显。这些发现揭示了当前LLMs在推理能力方面的根本局限,并凸显了开发专门适配CTI工作流与自动化需求模型的必要性。