CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

Cyber Threat Intelligence (CTI) is foundational to modern cybersecurity, enabling organizations to proactively defend against evolving threats. However, the sheer volume and heterogeneity of CTI data, spanning structured knowledge bases (CVE, CWE, CAPEC, MITRE ATT&CK) and unstructured threat reports, far exceed the capacity of manual analysis. The strong contextual understanding and reasoning of Large Language Models (LLMs) have driven growing interest in applying them to CTI tasks. Yet no existing benchmark evaluates LLMs in a retrieval-augmented setting with a proper evaluation harness that grants access to the heterogeneous domain knowledge sources analysts rely on in practice. To address this gap, we present CTIConnect, a benchmark for systematically evaluating retrieval-augmented LLMs across the CTI task landscape. We construct a unified evaluation environment integrating five heterogeneous CTI sources into 1,860 expert-verified QA pairs spanning nine tasks across three categories: Entity Linking, Multi-Document Synthesis, and Entity Attribution. Extensive experiments on ten state-of-the-art LLMs reveal that the cross-source semantic gap manifests differently across task categories, demanding fundamentally different retrieval strategies, and that the performance bottleneck shifts between retrieval infrastructure and evidence utilization depending on the task. Our domain-specific strategies further outperform stronger general-purpose retrieval paradigms (retrieve-then-rerank, IRCoT), showing that closing this gap requires structural interventions rather than generic retrieval improvements. These findings hold across all ten LLMs, remain consistent on the full benchmark, and stay stable under temporal splits spanning 2008-2025. Together, they provide actionable guidance for designing scalable retrieval architectures over heterogeneous CTI ecosystems.

翻译：网络威胁情报（CTI）是现代网络安全的基石，能够帮助组织主动防御不断演变的威胁。然而，CTI数据在数量上的庞大性与来源上的异构性——涵盖结构化知识库（如CVE、CWE、CAPEC、MITRE ATT&CK）与非结构化威胁报告——远超人工分析的处理能力。大语言模型（LLM）凭借其强大的上下文理解与推理能力，推动了将其应用于CTI任务的广泛研究兴趣。然而，现有基准测试未能在检索增强场景下评估LLM，即缺乏一个合适评估框架，使模型能够访问分析师实际工作中依赖的异构领域知识源。为填补这一空白，我们提出CTIConnect——一个系统评估检索增强型LLM在CTI任务全景中表现的基准测试。我们构建了统一评估环境，整合五个异构CTI源，形成包含1860个经专家验证的问答对，覆盖三大类九项任务：实体链接、多文档合成及实体归因。针对十个先进LLM的广泛实验表明，跨源语义差距在不同任务类别中表现各异，需要根本不同的检索策略；任务不同时，性能瓶颈在检索基础设施与证据利用之间转移。我们的领域特定策略进一步优于通用型检索范式（如检索后重排序、IRCoT），表明弥合这一差距需要结构性干预，而非通用检索改进。这些发现对所有十个LLM均成立，在完整基准测试中保持一致，并在覆盖2008-2025年的时间分割下保持稳定。综上，研究结果为设计面向异构CTI生态系统的可扩展检索架构提供了可操作的指导。