Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.
翻译:线性文本分割是自然语言处理(NLP)中长期存在的一个问题,其目标是将连续文本划分为连贯且语义有意义的单元。尽管该任务具有重要意义,但由于定义主题边界的复杂性、语篇结构的多样性,以及平衡局部连贯性与全局上下文的需求,使其仍然充满挑战。这些困难阻碍了摘要生成、信息检索和问答等下游应用的发展。本文提出SegNSP,将线性文本分割任务重新定义为下一句预测(NSP)问题。尽管NSP在现代预训练中已基本被弃用,但其对句子间连续性的显式建模特性,使其天然适用于检测主题边界。我们提出一种与标签无关的NSP方法,该方法无需显式的主题标签即可预测下一句是否延续当前主题,并通过结合分割感知损失与更难的负采样策略来增强模型对语篇连续性的捕捉能力。与近期需要借助辅助主题分类任务来利用NSP的提案不同,我们的方法避免了任务特定的监督。我们在两个数据集上评估了模型性能:CitiLink-Minutes(我们为此建立了首个分割基准)和WikiSection。在CitiLink-Minutes上,SegNSP取得了0.79的B-$F_1$分数,与人工标注的主题转换高度吻合;在WikiSection上,其B-F$_1$分数达到0.65,比可复现的最强基线TopSeg高出0.17个绝对点。这些结果展现了模型具有竞争力且稳健的性能,凸显了通过建模句子间连续性来提升分割质量、支持下游NLP应用的有效性。