Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails

Research on email anomaly detection has typically relied on specially prepared datasets that may not adequately reflect the type of data that occurs in industry settings. In our research, at a major financial services company, privacy concerns prevented inspection of the bodies of emails and attachment details (although subject headings and attachment filenames were available). This made labeling possible anomalies in the resulting redacted emails more difficult. Another source of difficulty is the high volume of emails combined with the scarcity of resources making machine learning (ML) a necessity, but also creating a need for more efficient human training of ML models. Active learning (AL) has been proposed as a way to make human training of ML models more efficient. However, the implementation of Active Learning methods is a human-centered AI challenge due to potential human analyst uncertainty, and the labeling task can be further complicated in domains such as the cybersecurity domain (or healthcare, aviation, etc.) where mistakes in labeling can have highly adverse consequences. In this paper we present research results concerning the application of Active Learning to anomaly detection in redacted emails, comparing the utility of different methods for implementing active learning in this context. We evaluate different AL strategies and their impact on resulting model performance. We also examine how ratings of confidence that experts have in their labels can inform AL. The results obtained are discussed in terms of their implications for AL methodology and for the role of experts in model-assisted email anomaly screening.

翻译：电子邮件异常检测的研究通常依赖于专门准备的数据集，这类数据集可能无法充分反映工业环境中实际出现的数据类型。在一家大型金融服务公司的研究中，隐私问题阻止了对邮件正文和附件细节的检查（尽管主题行和附件文件名可用）。这使得对脱敏邮件中潜在异常进行标注变得更加困难。另一个困难来源是邮件数量庞大且资源稀缺，这使得机器学习成为必要，同时也需要更高效的人工训练机器学习模型。主动学习已被提出作为提高人工训练机器学习模型效率的一种方法。然而，由于人类分析员可能存在不确定性，主动学习方法的实施是一个以人为中心的人工智能挑战，而在网络安全领域（或医疗、航空等领域），标注任务可能进一步复杂化，因为这些领域中标注错误可能产生非常不利的后果。本文展示了关于主动学习在脱敏邮件异常检测中的应用研究成果，比较了在此背景下实施主动学习的不同方法的效用。我们评估了不同的主动学习策略及其对最终模型性能的影响，并探讨了专家对其标注的信心程度如何辅助主动学习。所获结果从其对主动学习方法论以及专家在模型辅助的电子邮件异常筛查中作用的启示方面进行了讨论。

相关内容

主动学习

关注 243

主动学习是机器学习（更普遍的说是人工智能）的一个子领域，在统计学领域也叫查询学习、最优实验设计。“学习模块”和“选择策略”是主动学习算法的2个基本且重要的模块。主动学习是“一种学习方法，在这种方法中，学生会主动或体验性地参与学习过程，并且根据学生的参与程度，有不同程度的主动学习。” （Bonwell＆Eison 1991）Bonwell＆Eison（1991）指出：“学生除了被动地听课以外，还从事其他活动。” 在高等教育研究协会（ASHE）的一份报告中，作者讨论了各种促进主动学习的方法。他们引用了一些文献，这些文献表明学生不仅要做听，还必须做更多的事情才能学习。他们必须阅读，写作，讨论并参与解决问题。此过程涉及三个学习领域，即知识，技能和态度（KSA）。这种学习行为分类法可以被认为是“学习过程的目标”。特别是，学生必须从事诸如分析，综合和评估之类的高级思维任务。

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日