Learning high quality sentence embeddings from dialogues has drawn increasing attentions as it is essential to solve a variety of dialogue-oriented tasks with low annotation cost. However, directly annotating and gathering utterance relationships in conversations are difficult, while token-level annotations, \eg, entities, slots and templates, are much easier to obtain. General sentence embedding methods are usually sentence-level self-supervised frameworks and cannot utilize token-level extra knowledge. In this paper, we introduce Template-aware Dialogue Sentence Embedding (TaDSE), a novel augmentation method that utilizes template information to effectively learn utterance representation via self-supervised contrastive learning framework. TaDSE augments each sentence with its corresponding template and then conducts pairwise contrastive learning over both sentence and template. We further enhance the effect with a synthetically augmented dataset that enhances utterance-template relation, in which entity detection (slot-filling) is a preliminary step. We evaluate TaDSE performance on five downstream benchmark datasets. The experiment results show that TaDSE achieves significant improvements over previous SOTA methods, along with a consistent Intent Classification task performance improvement margin. We further introduce a novel analytic instrument of Semantic Compression method, for which we discover a correlation with uniformity and alignment. Our code will be released soon.
翻译:从对话中学习高质量句子嵌入因其能以较低标注成本解决多种面向对话的任务而受到日益关注。然而,直接标注和收集对话中的话语关系较为困难,而token级标注(如实体、槽位和模板)则更易获取。通用句子嵌入方法通常是句子级自监督框架,无法利用token级额外知识。本文提出模板感知型对话句子嵌入(TaDSE),一种利用模板信息通过自监督对比学习框架有效学习话语表示的新型增强方法。TaDSE将每个句子与其对应模板进行增强,随后对句子和模板进行成对对比学习。我们进一步通过合成增强数据集增强话语-模板关系,其中实体检测(槽位填充)是初步步骤。我们在五个下游基准数据集上评估TaDSE性能。实验结果表明,TaDSE相较先前最先进方法取得显著提升,并持续改善意图分类任务性能。我们进一步引入新型分析工具——语义压缩方法,并发现其与均匀性和对齐性存在相关性。代码将很快发布。