The ever-increasing number of threats and the existing diversity of information sources pose challenges for Computer Emergency Response Teams (CERTs). To respond to emerging threats, CERTs must gather information in a timely and comprehensive manner. But the volume of sources and information leads to information overload. This paper contributes to the question of how to reduce information overload for CERTs. We propose clustering incoming information as scanning this information is one of the most tiresome, but necessary, manual steps. Based on current studies, we establish conditions for such a framework. Different types of evaluation metrics are used and selected in relation to the framework conditions. Furthermore, different document embeddings and distance measures are evaluated and interpreted in combination with clustering methods. We use three different corpora for the evaluation, a novel ground truth corpus based on threat reports, one security bug report (SBR) corpus, and one with news articles. Our work shows, it is possible to reduce the information overload by up to 84.8% with homogeneous clusters. A runtime analysis of the clustering methods strengthens the decision of selected clustering methods. The source code and dataset will be made publicly available after acceptance.
翻译:日益增多的威胁数量以及现有信息源的多样性,给计算机应急响应团队(CERTs)带来了挑战。为了应对新出现的威胁,CERTs 必须及时、全面地收集信息。然而,信息源和信息量的庞大导致信息过载。本文针对如何减少CERTs信息过载的问题做出了贡献。我们提出对输入信息进行聚类,因为扫描这些信息是最繁琐但必要的手动步骤之一。基于现有研究,我们为此类框架建立了条件。针对框架条件,使用并选择了不同类型的评估指标。此外,我们评估并解释了不同的文档嵌入和距离度量与聚类方法的组合效果。我们使用了三个不同的语料库进行评估:一个基于威胁报告构建的新型真值语料库、一个安全漏洞报告(SBR)语料库以及一个新闻文章语料库。我们的研究表明,通过同质聚类,可以将信息过载最多减少84.8%。对聚类方法的运行时间分析进一步强化了对所选聚类方法的决策。源代码和数据集将在录用后公开提供。