Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.
翻译:核变点检测已成为识别复杂数据结构变化的常用工具。现有理论在独立性假设下建立了检测的一致性,但现实序列数据(如文本)存在强依赖性。本文针对m-依赖数据建立了新的理论保证:在温和的附加假设下,我们证明了检测变点数量的一致性及其位置的弱一致性。我们通过基于大语言模型的仿真生成合成m-依赖文本来验证渐近性质。为补充理论结果,我们首次使用现代嵌入技术对文本分割任务中的核变点检测进行了全面实证研究。在多样化文本数据集上,采用文本嵌入的核变点检测在标准文本分割指标上均优于基线方法。通过对泰勒·斯威夫特推文的案例研究,我们证明核变点检测不仅具有可靠的理论与仿真基础,在实际文本分割任务中也展现出显著的应用效能。