Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.
翻译:将文本分割为句子在许多自然语言处理系统中扮演着早期且关键的角色。这通常通过使用基于规则或依赖词汇特征(如标点符号)的统计方法来实现。尽管近期一些工作已不再完全依赖标点符号,但我们发现现有方法均未能同时满足以下所有要求:(i)对标点缺失的鲁棒性,(ii)对新领域的有效适应性,以及(iii)高效率。为此,我们提出了一种新模型——Segment any Text(SaT)——以解决这一问题。为增强鲁棒性,我们设计了一种新的预训练方案,以减少对标点符号的依赖。针对适应性问题,我们引入了一个参数高效的额外微调阶段,在歌词诗句和法律文书等不同领域实现了最先进的性能。在此过程中,我们通过架构改进使模型速度较先前最优方法提升三倍,并解决了模型对远期上下文的虚假依赖问题。此外,我们提出了一个在多语言混合句子分割数据上进行微调的模型变体,可作为现有分割工具的即插即用增强方案。总体而言,我们的贡献为任意文本分割提供了一种通用方法。在涵盖多个领域和语言的8个语料库上,尤其在文本格式混乱的实际应用场景中,我们的方法超越了所有基线模型(包括强大的大语言模型)。我们的模型与代码(含文档)已基于MIT许可发布于 https://huggingface.co/segment-any-text。