Despite the high volume of open-source Cyber Threat Intelligence (CTI), our understanding of long-term threat actor-victim dynamics remains fragmented due to inconsistent reporting standards and the lack of structured datasets containing comprehensive analytic information. In this paper, we present a large-scale automated analysis of open-source CTI reports spanning two decades. We develop a high-precision, LLM-based pipeline to ingest and structure 16,096 reports, extracting key entities such as attributed threat actors, motivations, victims, reporting vendors, and technical indicators (IoCs and TTPs). Our analysis quantifies the evolution of CTI information density and specialization, characterizing patterns that relate specific threat actors to motivations and victim profiles. Furthermore, we perform a meta-analysis of the CTI industry itself. We identify a fragmented ecosystem of distinct silos where vendors demonstrate significant geographic and sectoral reporting biases. Our marginal coverage analysis reveals that intelligence overlap between vendors is typically low: while a few core providers may offer broad situational awareness, additional sources yield diminishing returns. Overall, our findings characterize the structural biases inherent in the CTI ecosystem, enabling practitioners and researchers to better evaluate the completeness of their intelligence sources.
翻译:尽管开源网络威胁情报(CTI)数量庞大,但由于报告标准不统一,且缺乏包含全面分析信息的结构化数据集,我们对长期威胁行为体与受害方动态的理解仍然支离破碎。本文提出一种大规模自动化分析方法,对跨越二十年的开源CTI报告进行研究。我们开发了一条基于大语言模型(LLM)的高精度处理流水线,用以摄入并结构化16,096份报告,提取关键实体,包括归因威胁行为体、动机、受害方、报告供应商及技术指标(入侵指标与战术、技术、程序[IoCs与TTPs])。我们的分析量化了CTI信息密度与专业化的演变,刻画了将特定威胁行为体与动机及受害方特征相关联的模式。此外,我们还对CTI行业本身进行了元分析。我们发现了一个由不同孤岛构成的碎片化生态系统,其中各供应商在地域和行业报告上均表现出显著偏差。我们的边际覆盖分析显示,供应商之间的情报重叠通常较低:少数核心供应商或可提供广泛态势感知,但额外来源带来的收益呈递减趋势。总体而言,我们的研究结果揭示了CTI生态系统固有的结构性偏差,有助于从业者和研究人员更准确地评估其情报来源的完备性。