Detecting abrupt changes in real-time data streams from scientific simulations presents a challenging task, demanding the deployment of accurate and efficient algorithms. Identifying change points in live data stream involves continuous scrutiny of incoming observations for deviations in their statistical characteristics, particularly in high-volume data scenarios. Maintaining a balance between sudden change detection and minimizing false alarms is vital. Many existing algorithms for this purpose rely on known probability distributions, limiting their feasibility. In this study, we introduce the Kernel-based Cumulative Sum (KCUSUM) algorithm, a non-parametric extension of the traditional Cumulative Sum (CUSUM) method, which has gained prominence for its efficacy in online change point detection under less restrictive conditions. KCUSUM splits itself by comparing incoming samples directly with reference samples and computes a statistic grounded in the Maximum Mean Discrepancy (MMD) non-parametric framework. This approach extends KCUSUM's pertinence to scenarios where only reference samples are available, such as atomic trajectories of proteins in vacuum, facilitating the detection of deviations from the reference sample without prior knowledge of the data's underlying distribution. Furthermore, by harnessing MMD's inherent random-walk structure, we can theoretically analyze KCUSUM's performance across various use cases, including metrics like expected delay and mean runtime to false alarms. Finally, we discuss real-world use cases from scientific simulations such as NWChem CODAR and protein folding data, demonstrating KCUSUM's practical effectiveness in online change point detection.
翻译:从科学模拟的实时数据流中检测突变是一项具有挑战性的任务,需要部署准确且高效的算法。识别实时数据流中的变点涉及持续检查新观测值在统计特征上的偏差,尤其是在高数据量场景下。在突变检测与最小化误报之间保持平衡至关重要。许多现有算法依赖已知概率分布,限制了其可行性。本研究引入基于核的累积和(KCUSUM)算法,作为传统累积和(CUSUM)方法的非参数扩展,该方法因其在较宽松条件下在线变点检测中的有效性而备受关注。KCUSUM通过直接将输入样本与参考样本进行比较,并基于最大均值差异(MMD)非参数框架计算统计量,从而将自身分割。该方法将KCUSUM的应用范围拓展至仅需参考样本的场景,例如真空中的蛋白质原子轨迹,无需事先了解数据分布即可检测与参考样本的偏差。此外,利用MMD固有的随机游走结构,我们可从理论上分析KCUSUM在各种用例中的性能,包括预期延迟和平均误报运行时间等指标。最后,我们讨论了来自科学模拟(如NWChem CODAR和蛋白质折叠数据)的实际应用案例,展示了KCUSUM在在线变点检测中的实际有效性。