Threat hunting analyzes large, noisy, high-dimensional data to find sparse adversarial behavior. We believe adversarial activities, however they are disguised, are extremely difficult to completely obscure in high dimensional space. In this paper, we employ these latent features of cyber data to find anomalies via a prototype tool called Cyber Log Embeddings Model (CLEM). CLEM was trained on Zeek network traffic logs from both a real-world production network and an from Internet of Things (IoT) cybersecurity testbed. The model is deliberately overtrained on a sliding window of data to characterize each window closely. We use the Adjusted Rand Index (ARI) to comparing the k-means clustering of CLEM output to expert labeling of the embeddings. Our approach demonstrates that there is promise in using natural language modeling to understand cyber data.
翻译:威胁狩猎通过分析大规模、高噪声、高维数据来发现稀疏的对抗行为。我们认为,无论对抗活动如何伪装,在高维空间中完全隐藏其踪迹是极其困难的。本文利用网络数据的这些潜在特征,通过名为网络日志嵌入模型的原型工具进行异常检测。该模型基于真实生产网络的Zeek流量日志与物联网网络安全测试平台的日志进行训练。我们采用滑动数据窗口对模型进行有意的过拟合训练,以实现对各窗口特征的精细刻画。通过调整兰德指数对比CLEM输出的k均值聚类结果与专家对嵌入向量的标注,验证了利用自然语言建模理解网络数据的可行性。