Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

Recent deep learning approaches for network intrusion detection increasingly incorporate temporal architectures such as recurrent networks and Transformers, often reporting near-perfect performance on CIC-IDS2017. However, many existing studies neither supply their temporal modules with genuine sequence inputs nor evaluate under realistic, leakage-free conditions, making it unclear whether reported gains arise from true sequence-modeling capability. In this work, we reformulate CIC-IDS2017 as a temporal intrusion-detection task by constructing ordered flow sequences from network conversations and benchmarking nine classical and deep learning architectures under a random split, two leakage-free splits, and a padding-scheme ablation. The central finding is that padding convention, not architecture, determines the Transformer's performance: on genuinely sequential (non-padded) windows the Transformer achieves the highest macro-F1 of any model in the experiment (0.89); under zero-pad+mask evaluation it drops markedly (-0.24 macro-F1), while LSTM, GRU, and 1D-CNN remain stable. Under leakage-free group evaluation the Random Forest is the most robust model (+0.009), while the Transformer's false-alarm rate grows from 0.04% to 2.7%, a 67-fold increase invisible under conventional protocols. These findings demonstrate that evaluation methodology -- specifically padding convention and split protocol -- has a larger effect on reported performance than architectural choice, and that widely used random splits with repeat-last padding can overestimate model robustness by up to 0.24 macro-F1. We advocate leakage-free splits, explicit padding disclosure, and sequence-aware benchmarking as standard practice in future IDS research. Code and implementation details are available at https://github.com/zachmocz/temporal-ids-bench.

翻译：近期针对网络入侵检测的深度学习方法越来越多地采用循环网络与Transformer等时间架构，这些方法在CIC-IDS2017数据集上常报告接近完美的性能。然而，现有研究多数既未向时间模块输入真正序列数据，也未在真实无泄漏条件下进行评估，导致其性能提升是否源于真实序列建模能力难以明确。本研究通过从网络会话中构建有序流序列，并在随机划分、两种无泄漏划分及填充方案消融实验下对九种经典与深度学习架构进行基准测试，将CIC-IDS2017重构为时间入侵检测任务。核心发现是：决定Transformer性能的并非架构本身，而是填充规则——在真正序列（无填充）窗口上，Transformer实现了实验中所有模型的最高宏F1值（0.89）；在零填充+掩码评估下其性能显著下降（宏F1降0.24），而LSTM、GRU与1D-CNN保持稳定。在无泄漏分组评估中，随机森林是最鲁棒的模型（提升+0.009），但Transformer的误报率从0.04%激增至2.7%，增幅达67倍，该现象在常规协议下不可见。这些发现表明：评估方法论（特别是填充规则与划分协议）对报告性能的影响大于架构选择，而广泛使用的随机划分结合重复末帧填充可能导致模型鲁棒性被高估高达0.24宏F1值。我们倡导将无泄漏划分、明确填充披露及序列感知基准测试作为未来入侵检测研究的标准实践。代码与实现细节可在https://github.com/zachmocz/temporal-ids-bench获取。