In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems

The remarkable success of the use of machine learning-based solutions for network security problems has been impeded by the developed ML models' inability to maintain efficacy when used in different network environments exhibiting different network behaviors. This issue is commonly referred to as the generalizability problem of ML models. The community has recognized the critical role that training datasets play in this context and has developed various techniques to improve dataset curation to overcome this problem. Unfortunately, these methods are generally ill-suited or even counterproductive in the network security domain, where they often result in unrealistic or poor-quality datasets. To address this issue, we propose an augmented ML pipeline that leverages explainable ML tools to guide the network data collection in an iterative fashion. To ensure the data's realism and quality, we require that the new datasets should be endogenously collected in this iterative process, thus advocating for a gradual removal of data-related problems to improve model generalizability. To realize this capability, we develop a data-collection platform, netUnicorn, that takes inspiration from the classic "hourglass" model and is implemented as its "thin waist" to simplify data collection for different learning problems from diverse network environments. The proposed system decouples data-collection intents from the deployment mechanisms and disaggregates these high-level intents into smaller reusable, self-contained tasks. We demonstrate how netUnicorn simplifies collecting data for different learning problems from multiple network environments and how the proposed iterative data collection improves a model's generalizability.

翻译：基于机器学习的解决方案在网络安全问题中取得了显著成功，但其效能受限于该类模型在不同网络行为环境中的保持能力——此问题常被称为机器学习模型的泛化性问题。学界已认识到训练数据集在此方面的关键作用，并发展了多种技术以改进数据集管理来应对该问题。然而，这些方法在网络安全域中普遍不适用甚至产生反作用，往往导致生成不真实或低质量的数据集。为解决此问题，我们提出一种增强型机器学习流水线，通过可解释性机器学习工具迭代引导网络数据采集。为确保数据的真实性与质量，我们要求新数据集需在迭代过程中内生采集，从而渐进消除数据相关问题以提升模型泛化性。为实现此目标，我们开发了数据采集平台netUnicorn，其借鉴经典“沙漏模型”理念，并作为其“细腰”结构实现，以简化从异构网络环境中为不同学习问题采集数据的过程。所提系统将数据采集意图与部署机制解耦，并将高层级意图分解为更小、可复用的自包含任务。我们展示了netUnicorn如何简化从多网络环境为不同学习问题采集数据的过程，以及所提出的迭代式数据采集如何提升模型的泛化性。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日