Text semantic segmentation involves partitioning a document into multiple paragraphs with continuous semantics based on the subject matter, contextual information, and document structure. Traditional approaches have typically relied on preprocessing documents into segments to address input length constraints, resulting in the loss of critical semantic information across segments. To address this, we present CrossFormer, a transformer-based model featuring a novel cross-segment fusion module that dynamically models latent semantic dependencies across document segments, substantially elevating segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk methods within the Retrieval-Augmented Generation (RAG) system, producing more semantically coherent chunks that enhance its efficacy. Comprehensive evaluations confirm CrossFormer's state-of-the-art performance on public text semantic segmentation datasets, alongside considerable gains on RAG benchmarks.
翻译:文本语义分割旨在依据主题内容、上下文信息及文档结构,将文档划分为多个具有连续语义的段落。传统方法通常需将文档预处理为片段以应对输入长度限制,导致跨片段的关键语义信息丢失。为此,我们提出CrossFormer——一种基于Transformer的模型,其配备新颖的跨片段融合模块,能够动态建模文档片段间的潜在语义依赖关系,从而显著提升分割精度。此外,CrossFormer可在检索增强生成(RAG)系统中替代基于规则的文本分块方法,生成语义更连贯的文本块以提升系统效能。综合评估表明,CrossFormer在公开文本语义分割数据集上达到领先性能,并在RAG基准测试中取得显著提升。