Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
翻译:无监督文本分割至关重要,因为边界标注成本高昂、具有主观性,且通常难以跨领域和不同粒度选择进行迁移。我们提出了Embed-KCPD,这是一种无需训练的方法,将句子表示为嵌入向量,并通过最小化惩罚化KCPD目标来估计边界。除了算法实现外,据我们所知,我们首次针对$m$依赖序列(语言中常见的短程依赖的有限记忆抽象)建立了依赖感知的KCPD理论。我们证明了总体惩罚风险的oracle不等式,以及定位保证:每个真实变点均能在相对于段落长度较小的窗口内被恢复。为连接理论与实践,我们引入了一个基于LLM的仿真框架,该框架生成具有受控有限记忆依赖和已知边界的合成文档,从而验证了预测的标度行为。在标准分割基准测试中,Embed-KCPD通常优于强无监督基线。针对Taylor Swift推文的案例研究表明,Embed-KCPD融合了坚实的理论保证、仿真可靠性以及文本分割的实际有效性。