System-generated logs underpin security monitoring, yet their rigid template-based format hinders both automated analysis and human comprehension. We present NLLog (Natural-Language Log), a lightweight pipeline that deterministically rewrites parsed templates into WHO-WHAT-SEVERITY sentences, pools them with term-frequency-inverse-document-frequency weighting, classifies sessions with tree ensembles, and back-projects evidence with TreeSHAP for analyst review. On Hadoop Distributed File System (HDFS) and Blue Gene/L (BGL) corpora, NLLog exceeds two reproduced matched-protocol baselines; across HDFS, BGL, and the AIT Alert Data Set, it sustains low false-positive rates with commodity-hardware latency suitable for security operations center triage. Coverage, sparse-versus-dense, faithfulness, and adversarial ablations show that fallback sufficiency is corpus-dependent, that an enrollment-time coverage check can surface refinement requirements before deployment, and that an auditable deterministic rewrite combined with lightweight dense encoding provides a measurable representation layer for log-anomaly detection and triage.
翻译:系统生成的日志支撑着安全监控,但其僵化的基于模板的格式既阻碍了自动化分析,也妨碍了人类理解。本文提出NLLog(自然语言日志)——一种轻量级流水线,它确定性地将解析后的模板重写为“谁-做了什么-严重程度”句式,通过词频-逆文档频率加权对其进行池化,使用树集成对会话进行分类,并通过TreeSHAP反向投影证据供分析师审查。在Hadoop分布式文件系统(HDFS)和Blue Gene/L(BGL)语料库上,NLLog超越了两种复现的匹配协议基线;在HDFS、BGL及AIT告警数据集上,它以适合安全运营中心分诊的商用硬件延迟维持了较低的误报率。覆盖率、稀疏与密集、保真度及对抗性消融实验表明:回退充分性依赖于语料库,注册时的覆盖率检查可在部署前揭示细化需求,且可审计的确定性重写结合轻量级密集编码为日志异常检测与分诊提供了可衡量的表示层。