In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model's in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.
翻译:在本研究中,我们利用来自三个互联网文本源(Reddit、Stackexchange、Arxiv)的用户定义标签,训练了21种不同的机器学习模型,用于在自然文本中检测网络安全讨论的话题分类任务。我们通过交叉验证实验分析了每种模型在21种模型中的假阳性和假阴性率。然后,我们提出了一种网络安全话题分类(CTC)工具,该工具采用21个训练好的机器学习模型的多数投票作为检测网络安全相关文本的决策机制。我们还表明,CTC工具的多数投票机制平均提供了比21个单独模型更低的假阴性和假阳性率。我们展示了CTC工具可扩展到数十万份文档,且挂钟时间在数小时级别。