Learning from Limited Heterogeneous Training Data: Meta-Learning for Unsupervised Zero-Day Web Attack Detection across Web Domains

Recently unsupervised machine learning based systems have been developed to detect zero-day Web attacks, which can effectively enhance existing Web Application Firewalls (WAFs). However, prior arts only consider detecting attacks on specific domains by training particular detection models for the domains. These systems require a large amount of training data, which causes a long period of time for model training and deployment. In this paper, we propose RETSINA, a novel meta-learning based framework that enables zero-day Web attack detection across different domains in an organization with limited training data. Specifically, it utilizes meta-learning to share knowledge across these domains, e.g., the relationship between HTTP requests in heterogeneous domains, to efficiently train detection models. Moreover, we develop an adaptive preprocessing module to facilitate semantic analysis of Web requests across different domains and design a multi-domain representation method to capture semantic correlations between different domains for cross-domain model training. We conduct experiments using four real-world datasets on different domains with a total of 293M Web requests. The experimental results demonstrate that RETSINA outperforms the existing unsupervised Web attack detection methods with limited training data, e.g., RETSINA needs only 5-minute training data to achieve comparable detection performance to the existing methods that train separate models for different domains using 1-day training data. We also conduct real-world deployment in an Internet company. RETSINA captures on average 126 and 218 zero-day attack requests per day in two domains, respectively, in one month.

翻译：近期，基于无监督机器学习的方法已被开发用于检测零日Web攻击，这类技术能有效增强现有Web应用防火墙（WAF）的性能。然而，既有研究仅针对特定域，通过训练专属检测模型来检测攻击。这些系统需要大量训练数据，导致模型训练与部署周期冗长。本文提出RETSINA——一种基于元学习的新型框架，能够在组织内不同域之间，以有限训练数据实现零日Web攻击检测。具体而言，该方法利用元学习在异构域间共享知识（例如HTTP请求在异构域中的关联模式），从而高效训练检测模型。此外，我们开发了自适应预处理模块以促进跨域Web请求的语义分析，并设计了多域表示方法以捕获不同域间的语义相关性，支持跨域模型训练。我们基于四个真实世界数据集（涵盖不同域，总计2.93亿条Web请求）开展实验。结果表明，RETSINA在训练数据受限场景下优于现有无监督Web攻击检测方法：例如，仅需5分钟训练数据即可达到现有方法使用1天训练数据为不同域分别训练模型的检测性能。我们还在一家互联网公司进行了真实部署。为期一个月的监测显示，RETSINA在两个域中分别日均捕获126次和218次零日攻击请求。