Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Topic segmentation is critical for obtaining structured documents and improving downstream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improve $F_1$ of old SOTA by 3.42 (73.74 -> 77.16) and reduces $P_k$ by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on $P_k$ on WikiSection. The average relative $P_k$ drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach.

翻译：主题分割对于获取结构化文档以及改善信息检索等下游任务至关重要。由于能够从丰富的标注数据中自动探索主题切换线索，近年来监督神经模型极大地推动了长文档主题分割的发展，但连贯性与主题分割之间的深层关系仍未被充分探索。因此，本文从逻辑结构和语义相似性两个角度增强监督模型捕捉连贯性的能力，以进一步提升主题分割性能，提出了主题感知句子结构预测（TSSP）和对比语义相似性学习（CSSL）。具体而言，TSSP任务通过让模型学习打乱文档中相邻句子之间的原始关系，迫使模型理解结构信息，该打乱文档是通过在主题和句子层面联合扰乱原始文档构建的。此外，我们利用主题内和主题间信息构建对比样本，并设计CSSL目标函数，确保同一主题内的句子表示具有更高相似性，而不同主题间的句子表示相似性较低。大量实验表明，采用我们方法的Longformer模型显著超越了旧的最先进方法（SOTA）。在WIKI-727K上，我们的方法将旧SOTA的$F_1$提升了3.42（73.74 -> 77.16），并将$P_k$降低了1.11个百分点（15.0 -> 13.89）；在WikiSection上，$P_k$平均相对降低了4.3%。在两个领域外数据集上$P_k$平均相对降低8.38%，也证明了我们方法的鲁棒性。