We introduce a new powerful scan statistic and an associated test for detecting the presence and pinpointing the location of a change point within the distribution of a data sequence where the data elements take values in a general separable metric space $(\Omega, d)$. These change points mark abrupt shifts in the distribution of the data sequence. Our method hinges on distance profiles, where the distance profile of an element $\omega \in \Omega$ is the distribution of distances from $\omega$ as dictated by the data. Our approach is fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. From a practicable point of view, it is nearly tuning parameter-free, except for the specification of cut-off intervals near the endpoints where change points are assumed not to occur. Our theoretical results include a precise characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points and rigorous guarantees on the consistency of the test in the presence of change points under contiguous alternatives, as well as for the consistency of the estimated change point location. Through comprehensive simulation studies encompassing multivariate data, bivariate distributional data and sequences of graph Laplacians, we demonstrate the effectiveness of our approach in both change point detection power and estimating the location of the change point. We apply our method to real datasets, including U.S. electricity generation compositions and Bluetooth proximity networks, underscoring its practical relevance.
翻译:我们引入一种新的强效扫描统计量及其关联检验,用于检测数据序列分布中是否存在变点并精确定位其位置,其中数据元素取值于一般可分度量空间$(\Omega, d)$。这些变点标志着数据序列分布的突变。该方法的核心是距离剖面——元素$\omega \in \Omega$的距离剖面定义为由数据决定的从$\omega$出发的距离分布。所提方法完全非参数化,广泛适用于各类数据类型(包括分布数据和网络数据),只要可获得数据对象间的距离。从实用角度来看,该方法几乎无需调参,仅需在假设不存在变点的端点附近指定截断区间。理论结果包括:在无变点原假设下检验统计量渐近分布的精确刻画;在存在变点且处于连续备择假设时检验一致性的严格保证;以及变点位置估计的一致性。通过涵盖多元数据、二元分布数据和图拉普拉斯序列的综合模拟研究,我们验证了该方法在变点检测能力和位置估计方面的有效性。将该方法应用于美国发电构成数据和蓝牙近邻网络等真实数据集,进一步凸显其实用价值。