This project explores large language models (LLMs) for anomaly detection across heterogeneous log sources. Traditional intrusion detection systems suffer from high false positive rates, semantic blindness, and data scarcity, as logs are inherently sensitive, making clean datasets rare. We address these challenges through three contributions: (1) LogAtlas-Foundation-Sessions and LogAtlas-Defense-Set, balanced and heterogeneous log datasets with explicit attack annotations and privacy preservation; (2) empirical benchmarking revealing why standard metrics such as F1 and accuracy are misleading for security applications; and (3) a two phase training framework combining log understanding (Base-AMAN, 3B parameters) with real time detection (AMAN, 0.5B parameters via knowledge distillation). Results demonstrate practical feasibility, with inference times of 0.3-0.5 seconds per session and operational costs below 50 USD per day.
翻译:本项目探索利用大语言模型(LLMs)进行跨异构日志源的异常检测。传统入侵检测系统存在误报率高、语义盲区及数据稀缺等问题,由于日志本身具有敏感性,导致干净数据集极为罕见。我们通过三项贡献应对这些挑战:(1)LogAtlas-Foundation-Sessions与LogAtlas-Defense-Set——包含明确攻击标注且兼顾隐私保护的平衡型异构日志数据集;(2)实证基准测试揭示为何F1分数和准确率等标准指标在安全应用中具有误导性;(3)融合日志理解(Base-AMAN,30亿参数)与实时检测(AMAN,通过知识蒸馏获得5亿参数)的两阶段训练框架。实验结果表明该方案具备实际可行性,单会话推理时间0.3-0.5秒,每日运营成本低于50美元。