Despite the high volume of open-source Cyber Threat Intelligence (CTI), our understanding of long-term threat actor-victim dynamics remains fragmented due to the lack of structured datasets and inconsistent reporting standards. In this paper, we present a large-scale automated analysis of open-source CTI reports spanning two decades. We develop a high-precision, LLM-based pipeline to ingest and structure 13,308 reports, extracting key entities such as attributed threat actors, motivations, victims, reporting vendors, and technical indicators (IoCs and TTPs). Our analysis quantifies the evolution of CTI information density and specialization, characterizing patterns that relate specific threat actors to motivations and victim profiles. Furthermore, we perform a meta-analysis of the CTI industry itself. We identify a fragmented ecosystem of distinct silos where vendors demonstrate significant geographic and sectoral reporting biases. Our marginal coverage analysis reveals that intelligence overlap between vendors is typically low: while a few core providers may offer broad situational awareness, additional sources yield diminishing returns. Overall, our findings characterize the structural biases inherent in the CTI ecosystem, enabling practitioners and researchers to better evaluate the completeness of their intelligence sources.
翻译:尽管开源网络威胁情报(CTI)数量庞大,但由于缺乏结构化数据集和报告标准不一致,我们对长期威胁行为者-受害者动态的理解仍然碎片化。本文对跨越二十年的开源CTI报告进行了大规模自动化分析。我们开发了一个基于大语言模型的高精度处理流程,对13,308份报告进行结构化处理,提取关键实体,包括归因威胁行为者、动机、受害者、报告厂商以及技术指标(IoCs与TTPs)。我们的分析量化了CTI信息密度与专业化的演变过程,刻画了特定威胁行为者与动机、受害者画像之间的关联模式。此外,我们对CTI行业本身进行了元分析。我们发现了一个由独立信息孤岛构成的碎片化生态系统,其中厂商报告呈现出显著的地域与行业偏见。我们的边际覆盖分析表明,厂商间情报重叠率普遍较低:少数核心供应商可能提供广泛的态势感知,但增加情报来源带来的边际收益递减。总体而言,我们的研究揭示了CTI生态系统固有的结构性偏见,有助于从业者和研究者更准确地评估其情报来源的完备性。