Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning (ML), leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span ($10$ years), with naturally occurring changes over time (eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models, ranging from classical approaches to deep learning. Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical training which assumes independent and identically distributed data (on average, by up to $3\%$ for our approach). Dataset and code are available at https://github.com/bit-ml/AnoShift/.
翻译:分析数据的分布偏移是当今机器学习领域一个日益增长的研究方向,催生了专注于为研究机器学习模型泛化特性提供合适场景的新兴基准。现有基准主要针对监督学习,据我们所知,尚无针对无监督学习的基准。因此,我们引入了一个包含随时间变化数据的无监督异常检测基准,该基准基于网络入侵检测数据集Kyoto-2006+构建。此类数据符合输入分布偏移的前提:它覆盖了较长的时间跨度(10年),包含随时间自然发生的变化(例如用户行为模式改变、软件更新)。我们首先通过基本单特征分析、t-SNE以及用于衡量年份间整体分布距离的最优传输方法,强调了数据的非平稳性。随后,我们提出了AnoShift协议,将数据划分为IID、NEAR和FAR测试子集。我们使用从经典方法到深度学习的不同模型验证了性能随时间推移的退化。最后,我们证明,通过承认分布偏移问题并对其进行适当处理,与假设数据独立同分布的传统训练相比,性能可以得到改善(我们的方法平均提升高达3%)。数据集和代码可在https://github.com/bit-ml/AnoShift/获取。