The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPr\'ecis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPr\'ecis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPr\'ecis, released as open source, paves the way for better and more responsive defense against cyberattacks.
翻译:安全相关日志的收集是理解攻击行为和诊断漏洞的关键,然而其分析仍然是一项艰巨挑战。近年来,语言模型(LM)在理解自然语言和编程语言方面展现出无与伦比的潜力。问题在于,由于安全日志包含本质上混淆和模糊的信息,语言模型是否以及如何也能对安全专家有用。本文系统研究了如何利用最先进的语言模型来自动分析类文本型Unix Shell攻击日志。我们提出了一套完整的设计方法论,由此构建了LogPrécis。该系统以原始Shell会话为输入,自动识别并分配每个会话片段对应的攻击者策略,即揭示攻击者目标的序列。我们展示了LogPrécis支持对两个包含约40万个独特Unix Shell攻击的大型数据集进行分析的能力。LogPrécis将这些攻击缩减为约3000个指纹,每个指纹对应具有相同策略序列的会话组。其提供的抽象能力使分析人员能够更好地理解攻击、识别指纹、检测新颖性、关联相似攻击,并追踪攻击家族及其变异。总体而言,作为开源发布的LogPrécis,为更有效、更及时的网络安全防御铺平了道路。