Detecting changes is of fundamental importance when analyzing data streams and has many applications, e.g., predictive maintenance, fraud detection, or medicine. A principled approach to detect changes is to compare the distributions of observations within the stream to each other via hypothesis testing. Maximum mean discrepancy (MMD; also called energy distance) is a well-known (semi-)metric on the space of probability distributions. MMD gives rise to powerful non-parametric two-sample tests on kernel-enriched domains under mild conditions, which makes its deployment for change detection desirable. However, the classic MMD estimators suffer quadratic complexity, which prohibits their application in the online change detection setting. We propose a general-purpose change detection algorithm, Maximum Mean Discrepancy on Exponential Windows (MMDEW), which leverages the MMD two-sample test, facilitates its efficient online computation on any kernel-enriched domain, and is able to detect any disparity between distributions. Our experiments and analysis show that (1) MMDEW achieves better detection quality than state-of-the-art competitors and that (2) the algorithm has polylogarithmic runtime and logarithmic memory requirements, which allow its deployment to the streaming setting.
翻译:变化检测是分析数据流时的基本问题,在预测性维护、欺诈检测或医学等领域具有广泛应用。一种基于原则的变化检测方法是通过假设检验比较流中观测值之间的分布。最大均值差异(MMD;也称为能量距离)是概率分布空间上一种著名的(半)度量。在温和条件下,MMD可在核扩充域上构建强大的非参数双样本检验,这使得将其应用于变化检测具有吸引力。然而,经典MMD估计器具有二次复杂度,这阻碍了其在在线变化检测场景中的应用。我们提出了一种通用变化检测算法——指数窗口最大均值差异(MMDEW),该算法利用MMD双样本检验,可在任意核扩充域上实现高效的在线计算,并能检测分布之间的任何差异。我们的实验与分析表明:(1)MMDEW在检测质量上优于现有最佳竞争方法;(2)该算法具有多对数运行时复杂度和对数级内存需求,使其能够部署于流式处理场景。