Many domains now leverage the benefits of Machine Learning (ML), which promises solutions that can autonomously learn to solve complex tasks by training over some data. Unfortunately, in cyberthreat detection, high-quality data is hard to come by. Moreover, for some specific applications of ML, such data must be labeled by human operators. Many works "assume" that labeling is tough/challenging/costly in cyberthreat detection, thereby proposing solutions to address such a hurdle. Yet, we found no work that specifically addresses the process of labeling 'from the viewpoint of ML security practitioners'. This is a problem: to this date, it is still mostly unknown how labeling is done in practice -- thereby preventing one from pinpointing "what is needed" in the real world. In this paper, we take the first step to build a bridge between academic research and security practice in the context of data labeling. First, we reach out to five subject matter experts and carry out open interviews to identify pain points in their labeling routines. Then, by using our findings as a scaffold, we conduct a user study with 13 practitioners from large security companies, and ask detailed questions on subjects such as active learning, costs of labeling, and revision of labels. Finally, we perform proof-of-concept experiments addressing labeling-related aspects in cyberthreat detection that are sometimes overlooked in research. Altogether, our contributions and recommendations serve as a stepping stone to future endeavors aimed at improving the quality and robustness of ML-driven security systems. We release our resources.
翻译:如今许多领域都受益于机器学习(ML)的进步,它承诺通过数据训练自主学会解决复杂任务。然而,在网络威胁检测领域,高质量数据难以获取。此外,对于机器学习的某些特定应用,这类数据必须由人工操作员进行标注。许多研究“假设”网络威胁检测中的标注过程艰难/具有挑战性/成本高昂,并据此提出应对方案。但我们未发现任何专门从“机器学习安全从业者视角”探讨标注过程的研究。这一问题在于:迄今人们对实际中的标注操作方式仍知之甚少,因此无法准确指出现实世界的“真实需求”。本文首次尝试搭建数据标注领域学术研究与安全实践之间的桥梁。首先,我们与五位领域专家进行开放式访谈,识别其标注流程中的痛点;继而以这些发现为基础,对来自大型安全公司的13名从业者开展用户研究,针对主动学习、标注成本、标签修订等议题提出详细问题;最后,我们通过概念验证实验,探讨网络威胁检测中常被研究忽视的标注相关问题。综合来看,我们的贡献与建议将为提升ML驱动安全系统的质量与鲁棒性的未来研究奠定基础。相关资源已开源发布。