Dialogue Topic Segmentation (DTS) plays an essential role in a variety of dialogue modeling tasks. Previous DTS methods either focus on semantic similarity or dialogue coherence to assess topic similarity for unsupervised dialogue segmentation. However, the topic similarity cannot be fully identified via semantic similarity or dialogue coherence. In addition, the unlabeled dialogue data, which contains useful clues of utterance relationships, remains underexploited. In this paper, we propose a novel unsupervised DTS framework, which learns topic-aware utterance representations from unlabeled dialogue data through neighboring utterance matching and pseudo-segmentation. Extensive experiments on two benchmark datasets (i.e., DialSeg711 and Doc2Dial) demonstrate that our method significantly outperforms the strong baseline methods. For reproducibility, we provide our code and data at:https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/dial-start.
翻译:对话主题分割(DTS)在各种对话建模任务中扮演着至关重要的角色。以往的无监督对话分割方法要么聚焦于语义相似性,要么利用对话连贯性来评估主题相似性。然而,主题相似性无法通过语义相似性或对话连贯性得到完全识别。此外,包含话语关系有用线索的无标注对话数据仍未得到充分利用。本文提出了一种新颖的无监督DTS框架,该框架通过相邻话语匹配和伪分割技术,从无标注对话数据中学习主题感知的话语表示。在两个基准数据集(即DialSeg711和Doc2Dial)上进行的大量实验表明,我们的方法显著优于强基线方法。为确保可复现性,我们提供的代码和数据位于:https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/dial-start。