Many organizations rely on Threat Intelligence (TI) feeds to assess the risk associated with security threats. Due to the volume and heterogeneity of data, it is prohibitive to manually analyze the threat information available in different loosely structured TI feeds. Thus, there is a need to develop automated methods to vet and extract actionable information from TI feeds. To this end, we present a machine learning pipeline to automatically detect vulnerability exploitation from TI feeds. We first model threat vocabulary in loosely structured TI feeds using state-of-the-art embedding techniques (Doc2Vec and BERT) and then use it to train a supervised machine learning classifier to detect exploitation of security vulnerabilities. We use our approach to identify exploitation events in 191 different TI feeds. Our longitudinal evaluation shows that it is able to accurately identify exploitation events from TI feeds only using past data for training and even on TI feeds withheld from training. Our proposed approach is useful for a variety of downstream tasks such as data-driven vulnerability risk assessment.
翻译:许多组织依赖威胁情报(TI)源来评估安全威胁的相关风险。由于数据的体量和异构性,人工分析不同松散结构TI源中的威胁信息是难以实现的。因此,需要开发自动化方法来审查并从TI源中提取可操作信息。为此,我们提出了一种机器学习流程,用于自动从TI源中检测漏洞利用行为。我们首先使用前沿的嵌入技术(Doc2Vec和BERT)对松散结构TI源中的威胁词汇进行建模,随后利用该模型训练一个监督式机器学习分类器,以检测安全漏洞的利用情况。我们应用该方法在191个不同的TI源中识别漏洞利用事件。我们的纵向评估表明,该方法仅使用历史数据进行训练,即使在未参与训练的TI源上,也能准确识别出漏洞利用事件。我们提出的方法对于数据驱动的漏洞风险评估等多种下游任务具有实用价值。