Cyber-systems are under near-constant threat from intrusion attempts. Attacks types vary, but each attempt typically has a specific underlying intent, and the perpetrators are typically groups of individuals with similar objectives. Clustering attacks appearing to share a common intent is very valuable to threat-hunting experts. This article explores Dirichlet distribution topic models for clustering terminal session commands collected from honeypots, which are special network hosts designed to entice malicious attackers. The main practical implications of clustering the sessions are two-fold: finding similar groups of attacks, and identifying outliers. A range of statistical models are considered, adapted to the structures of command-line syntax. In particular, concepts of primary and secondary topics, and then session-level and command-level topics, are introduced into the models to improve interpretability. The proposed methods are further extended in a Bayesian nonparametric fashion to allow unboundedness in the vocabulary size and the number of latent intents. The methods are shown to discover an unusual MIRAI variant which attempts to take over existing cryptocurrency coin-mining infrastructure, not detected by traditional topic-modelling approaches.
翻译:网络系统几乎持续面临入侵威胁。攻击类型各异,但每次尝试通常具有特定的潜在意图,且攻击者多为目标相似的团体。对威胁狩猎专家而言,聚类具有共同意图的攻击行为极具价值。本文探讨采用狄利克雷分布主题模型对蜜罐收集的终端会话命令进行聚类的方法,蜜罐是专门用于诱捕恶意攻击者的特殊网络主机。会话聚类的主要实践意义体现在两方面:发现相似的攻击群组与识别异常行为。研究考虑了一系列适应命令行语法结构的统计模型,特别在模型中引入了主次主题、会话级与命令级主题等概念以提升可解释性。所提方法进一步以贝叶斯非参数形式进行扩展,允许词汇量与潜在意图数量无界增长。实验表明,该方法能发现传统主题建模方法未能检测到的异常MIRAI变种,该变种试图劫持现有加密货币挖矿基础设施。