We introduce a powerful scan statistic and the corresponding test for detecting the presence and pinpointing the location of a change point within the distribution of a data sequence with the data elements residing in a separable metric space $(Ω, d)$. These change points mark abrupt shifts in the distribution of the data sequence as characterized using distance profiles, where the distance profile of an element $ω\in Ω$ is the distribution of distances from $ω$ as dictated by the data. This approach is tuning parameter free, fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. We obtain an explicit characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points, rigorous guarantees on the consistency of the test in the presence of change points under fixed and local alternatives and near-optimal convergence of the estimated change point location, all under practicable settings. To compare with state-of-the-art methods we conduct simulations covering multivariate data, bivariate distributional data and sequences of graph Laplacians, and illustrate our method on real data sequences of the U.S. electricity generation compositions and Bluetooth proximity networks.
翻译:本文提出了一种强大的扫描统计量及其相应检验方法,用于检测数据序列分布中变点的存在并精确定位其位置,其中数据元素存在于可分离度量空间$(Ω, d)$中。这些变点标志着数据序列分布的突变,该特征通过距离剖面进行刻画——元素$ω\in Ω$的距离剖面即由数据决定的、从$ω$出发的距离分布。该方法无需调节参数,完全非参数化,且普遍适用于各类数据类型(包括分布数据和网络数据),只要数据对象间的距离可计算即可。我们在无变点的原假设下获得了检验统计量渐近分布的显式刻画,在固定备择与局部备择下证明了存在变点时检验的一致性,并得到了估计变点位置的近乎最优收敛速度,所有结论均在可实际应用的设定下成立。为与前沿方法进行比较,我们开展了涵盖多元数据、二元分布数据及图拉普拉斯矩阵序列的模拟实验,并在美国发电结构组成与蓝牙邻近网络的实际数据序列上展示了本方法的有效性。