Language Models for Novelty Detection in System Call Traces

Due to the complexity of modern computer systems, novel and unexpected behaviors frequently occur. Such deviations are either normal occurrences, such as software updates and new user activities, or abnormalities, such as misconfigurations, latency issues, intrusions, and software bugs. Regardless, novel behaviors are of great interest to developers, and there is a genuine need for efficient and effective methods to detect them. Nowadays, researchers consider system calls to be the most fine-grained and accurate source of information to investigate the behavior of computer systems. Accordingly, this paper introduces a novelty detection methodology that relies on a probability distribution over sequences of system calls, which can be seen as a language model. Language models estimate the likelihood of sequences, and since novelties deviate from previously observed behaviors by definition, they would be unlikely under the model. Following the success of neural networks for language models, three architectures are evaluated in this work: the widespread LSTM, the state-of-the-art Transformer, and the lower-complexity Longformer. However, large neural networks typically require an enormous amount of data to be trained effectively, and to the best of our knowledge, no massive modern datasets of kernel traces are publicly available. This paper addresses this limitation by introducing a new open-source dataset of kernel traces comprising over 2 million web requests with seven distinct behaviors. The proposed methodology requires minimal expert hand-crafting and achieves an F-score and AuROC greater than 95% on most novelties while being data- and task-agnostic. The source code and trained models are publicly available on GitHub while the datasets are available on Zenodo.

翻译：由于现代计算机系统的复杂性，新颖且意外的行为频繁出现。这些偏差可能是正常现象（例如软件更新和新用户活动），也可能是异常情况（例如配置错误、延迟问题、入侵和软件缺陷）。无论何种情况，新颖行为都引起开发者的极大关注，因此迫切需要高效且有效的方法来检测它们。如今，研究人员认为系统调用是研究计算机系统行为的最细粒度、最准确的信息来源。据此，本文引入了一种基于系统调用序列概率分布的异常检测方法，该方法可被视为一种语言模型。语言模型用于估计序列的似然性，而由于异常在定义上偏离了先前观察到的行为，它们在该模型下的概率会较低。基于神经网络在语言模型中的成功，本文评估了三种架构：广泛使用的LSTM、最先进的Transformer以及复杂度较低的Longformer。然而，大型神经网络通常需要海量数据才能有效训练，而据我们所知，目前尚无公开可用的大规模现代内核追踪数据集。本文通过引入一个新的开源内核追踪数据集解决了这一局限性，该数据集包含超过200万个网络请求，涵盖了七种不同的行为。所提出的方法需要极少的人工专家干预，在数据无关和任务无关的情况下，对大多数异常实现了超过95%的F分数和AuROC。源代码和训练模型已在GitHub上公开，数据集则在Zenodo上提供。