The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPr\'ecis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPr\'ecis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPr\'ecis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPr\'ecis, released as open source, paves the way for better and more responsive defense against cyberattacks.
翻译:安全相关日志的收集是理解攻击行为与诊断漏洞的关键,然而其分析仍是一项艰巨挑战。近年来,语言模型在理解自然语言与编程语言方面展现出前所未有的潜力。问题随之而来:鉴于安全日志包含本质混淆与模糊的信息,语言模型能否以及如何为安全专家提供帮助?本文系统研究了如何利用最先进的语言模型自动分析类文本Unix shell攻击日志。我们提出了一套完整的设计方法论,并据此构建了LogPrécis。该系统以原始shell会话为输入,自动识别并分配攻击策略至会话的每个部分,即揭示攻击者目标的序列。我们展示了LogPrécis支持分析两个包含约40万次独特Unix shell攻击的大型数据集的能力。LogPrécis将其缩减为约3000个指纹,每个指纹聚合了具有相同策略序列的会话。其所提供的抽象化能力使分析师能更好地理解攻击、识别指纹、检测新颖性、关联相似攻击以及追踪攻击家族与变种。总之,LogPrécis作为开源工具发布,为更高效、更敏捷的网络攻击防御铺平了道路。