Topic segmentation is critical for obtaining structured long documents and improving downstream tasks like information retrieval. Due to its ability of automatically exploring clues of topic shift from a large amount of labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship of semantic coherence and topic segmentation underexplored. Therefore, this paper enhances the supervised model's ability to capture coherence from both structure and similarity perspectives to further improve the topic segmentation performance, including the Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations of adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at the topic and sentence levels. In addition, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher semantic similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improves $F_{1}$ of old SOTA by 3.42 (73.74 -> 77.16) and reduces $P_{k}$ by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an average reduction of 0.83 points on $P_{k}$ on WikiSection. The average $P_{k}$ drop of 2.82 points on the two out-of-domain datasets also illustrates the robustness of our approach
翻译:主题分割对于获取结构化长文档以及提升信息检索等下游任务至关重要。近年来,有监督的神经模型能够从大量标注数据中自动探索主题转换线索,极大地促进了长文档主题分割的发展,但语义连贯性与主题分割之间的深层关系仍未被充分探索。因此,本文从结构和相似性两个角度增强监督模型捕捉连贯性的能力,以进一步提高主题分割性能,包括主题感知句子结构预测(TSSP)和对比语义相似性学习(CSSL)。具体而言,TSSP任务通过让模型学习打乱文档中相邻句子的原始关系,强制模型理解结构信息,该打乱文档通过联合打乱原始文档的主题和句子级别构建。此外,我们利用主题间和主题内信息构建对比样本,并设计CSSL目标,确保同一主题内的句子表示具有更高的语义相似性,而不同主题的句子表示相似性较低。大量实验表明,采用我们方法的Longformer显著优于旧的最佳方法(SOTA)。我们的方法在WIKI-727K上将旧SOTA的$F_{1}$提升了3.42(73.74 -> 77.16),并将$P_{k}$降低了1.11个点(15.0 -> 13.89);在WikiSection上平均降低了0.83个点的$P_{k}$。在两个域外数据集上$P_{k}$平均下降2.82个点,也说明了我们方法的鲁棒性。