Cyber-systems are under near-constant threat from intrusion attempts. Attacks types vary, but each attempt typically has a specific underlying intent, and the perpetrators are typically groups of individuals with similar objectives. Clustering attacks appearing to share a common intent is very valuable to threat-hunting experts. This article explores Dirichlet distribution topic models for clustering terminal session commands collected from honeypots, which are special network hosts designed to entice malicious attackers. The main practical implications of clustering the sessions are two-fold: finding similar groups of attacks, and identifying outliers. A range of statistical models are considered, adapted to the structures of command-line syntax. In particular, concepts of primary and secondary topics, and then session-level and command-level topics, are introduced into the models to improve interpretability. The proposed methods are further extended in a Bayesian nonparametric fashion to allow unboundedness in the vocabulary size and the number of latent intents. The methods are shown to discover an unusual MIRAI variant which attempts to take over existing cryptocurrency coin-mining infrastructure, not detected by traditional topic-modelling approaches.
翻译:网络系统几乎持续面临入侵尝试的威胁。攻击类型各异,但每次尝试通常具有特定的潜在意图,且攻击者通常为具有相似目标的个体群体。对看似具有共同意图的攻击进行聚类对于威胁狩猎专家而言极具价值。本文探讨了利用狄利克雷分布主题模型对从蜜罐(专为诱捕恶意攻击者而设计的特殊网络主机)收集的终端会话命令进行聚类的方法。对会话进行聚类的主要实际意义体现在两方面:发现相似的攻击群体以及识别异常值。研究考虑了一系列适应命令行语法结构的统计模型。特别地,模型中引入了主次主题概念,以及会话级与命令级主题概念,以提升模型的可解释性。所提出的方法进一步以贝叶斯非参数方式进行了扩展,允许词汇量和潜在意图数量无界增长。实验表明,该方法能够发现一种异常MIRAI变种,该变种试图劫持现有的加密货币挖矿基础设施,而传统主题建模方法未能检测到此类攻击。