Deep Learning for Contextualized NetFlow-Based Network Intrusion Detection: Methods, Data, Evaluation and Deployment

Network Intrusion Detection Systems (NIDS) have progressively shifted from signature-based techniques toward machine learning and, more recently, deep learning methods. Meanwhile, the widespread adoption of encryption has reduced payload visibility, weakening inspection pipelines that depend on plaintext content and increasing reliance on flow-level telemetry such as NetFlow and IPFIX. Many current learning-based detectors still frame intrusion detection as per-flow classification, implicitly treating each flow record as an independent sample. This assumption is often violated in realistic attack campaigns, where evidence is distributed across multiple flows and hosts, spanning minutes to days through staged execution, beaconing, lateral movement, and exfiltration. This paper synthesizes recent research on context-aware deep learning for flow-based intrusion detection. We organize existing methods into a four-dimensional taxonomy covering temporal context, graph or relational context, multimodal context, and multi-resolution context. Beyond modeling, we emphasize rigorous evaluation and operational realism. We review common failure modes that can inflate reported results, including temporal leakage, data splitting, dataset design flaws, limited dataset diversity, and weak cross-dataset generalization. We also analyze practical constraints that shape deployability, such as streaming state management, memory growth, latency budgets, and model compression choices. Overall, the literature suggests that context can meaningfully improve detection when attacks induce measurable temporal or relational structure, but the magnitude and reliability of these gains depend strongly on rigorous, causal evaluation and on datasets that capture realistic diversity.

翻译：网络入侵检测系统已逐步从基于签名的技术转向机器学习，并进一步发展到深度学习方法。与此同时，加密技术的广泛采用降低了有效载荷的可视性，削弱了依赖明文内容的检测流程，并增强了对NetFlow和IPFIX等流级遥测数据的依赖。当前许多基于学习的检测器仍将入侵检测视为逐流分类，隐含地将每条流记录视为独立样本。这一假设在实际攻击活动中常被违背，因为攻击证据分布在多个流和主机之间，通过分阶段执行、信标通信、横向移动和数据外泄等行为，可能持续数分钟至数天。本文系统综述了基于流的入侵检测中上下文感知深度学习的最新研究。我们将现有方法归纳为一个四维分类体系，涵盖时间上下文、图或关系上下文、多模态上下文以及多分辨率上下文。除建模方法外，我们强调严谨的评估与操作现实性。我们回顾了可能导致报告结果虚高的常见失效模式，包括时间泄漏、数据划分、数据集设计缺陷、有限的数据集多样性以及跨数据集泛化能力不足。同时，我们分析了影响可部署性的实际约束条件，如流式状态管理、内存增长、延迟预算和模型压缩选择。总体而言，文献表明当攻击产生可测量的时间或关系结构时，上下文能显著提升检测性能，但这些改进的幅度和可靠性在很大程度上取决于严谨的因果评估以及能够反映现实多样性的数据集。