The remarkable success of the use of machine learning-based solutions for network security problems has been impeded by the developed ML models' inability to maintain efficacy when used in different network environments exhibiting different network behaviors. This issue is commonly referred to as the generalizability problem of ML models. The community has recognized the critical role that training datasets play in this context and has developed various techniques to improve dataset curation to overcome this problem. Unfortunately, these methods are generally ill-suited or even counterproductive in the network security domain, where they often result in unrealistic or poor-quality datasets. To address this issue, we propose an augmented ML pipeline that leverages explainable ML tools to guide the network data collection in an iterative fashion. To ensure the data's realism and quality, we require that the new datasets should be endogenously collected in this iterative process, thus advocating for a gradual removal of data-related problems to improve model generalizability. To realize this capability, we develop a data-collection platform, netUnicorn, that takes inspiration from the classic "hourglass" model and is implemented as its "thin waist" to simplify data collection for different learning problems from diverse network environments. The proposed system decouples data-collection intents from the deployment mechanisms and disaggregates these high-level intents into smaller reusable, self-contained tasks. We demonstrate how netUnicorn simplifies collecting data for different learning problems from multiple network environments and how the proposed iterative data collection improves a model's generalizability.
翻译:基于机器学习的解决方案在网络安全问题中取得了显著成功,但其效能受限于该类模型在不同网络行为环境中的保持能力——此问题常被称为机器学习模型的泛化性问题。学界已认识到训练数据集在此方面的关键作用,并发展了多种技术以改进数据集管理来应对该问题。然而,这些方法在网络安全域中普遍不适用甚至产生反作用,往往导致生成不真实或低质量的数据集。为解决此问题,我们提出一种增强型机器学习流水线,通过可解释性机器学习工具迭代引导网络数据采集。为确保数据的真实性与质量,我们要求新数据集需在迭代过程中内生采集,从而渐进消除数据相关问题以提升模型泛化性。为实现此目标,我们开发了数据采集平台netUnicorn,其借鉴经典“沙漏模型”理念,并作为其“细腰”结构实现,以简化从异构网络环境中为不同学习问题采集数据的过程。所提系统将数据采集意图与部署机制解耦,并将高层级意图分解为更小、可复用的自包含任务。我们展示了netUnicorn如何简化从多网络环境为不同学习问题采集数据的过程,以及所提出的迭代式数据采集如何提升模型的泛化性。