Topic segmentation is critical for obtaining structured long documents and improving downstream tasks like information retrieval. Due to its ability of automatically exploring clues of topic shift from a large amount of labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship of semantic coherence and topic segmentation underexplored. Therefore, this paper enhances the supervised model's ability to capture coherence from both structure and similarity perspectives to further improve the topic segmentation performance, including the Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations of adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at the topic and sentence levels. In addition, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher semantic similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improves $F_{1}$ of old SOTA by 3.42 (73.74 -> 77.16) and reduces $P_{k}$ by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an average reduction of 0.83 points on $P_{k}$ on WikiSection. The average $P_{k}$ drop of 2.82 points on the two out-of-domain datasets also illustrates the robustness of our approach
翻译:主题分割对于获取结构化的长文档以及改进信息检索等下游任务至关重要。由于能够从大量标注数据中自动探索主题转换线索,近期的有监督神经模型极大地促进了长文档主题分割的发展,但尚未充分探索语义连贯性与主题分割之间的深层关系。因此,本文通过从结构和相似性两个角度增强有监督模型捕捉连贯性的能力,以进一步提升主题分割性能,具体包括主题感知句子结构预测(TSSP)和对比语义相似性学习(CSSL)。具体而言,TSSP任务通过让模型学习乱序文档中相邻句子的原始关系,强迫其理解结构信息——该乱序文档通过联合打乱原始文档的主题级和句子级顺序构建。此外,我们利用主题间和主题内信息构建对比样本,并设计CSSL目标函数,确保同一主题内的句子表示具有更高的语义相似性,而不同主题间的句子表示相似性较低。大量实验表明,采用我们方法的Longformer显著优于旧有最先进(SOTA)方法。我们的方法在WIKI-727K数据集上将旧SOTA的$F_{1}$提升了3.42(73.74 -> 77.16),并将$P_{k}$降低了1.11个点(15.0 -> 13.89);在WikiSection数据集上使$P_{k}$平均降低0.83个点。在两个跨领域数据集上$P_{k}$平均降低2.82个点,也证明了我们方法的鲁棒性。