Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.
翻译:新型意图发现自动化地对相似消息(问题)进行分组,以识别先前未知的意图。然而,当前研究集中于仅包含问题字段的公开数据集,这些数据集与现实数据集存在显著差异。本文提出了改进部署于大型电商平台中意图发现流水线的方法。我们展示了在领域内数据上进行语言模型预训练(包括自监督和弱监督)的益处。我们还设计了一种最佳方法,在微调聚类任务时利用现实数据集中的对话结构(即问答对),并将其命名为Conv。我们所有方法相结合,以充分利用现实数据集,相较于仅基于问题的当前最先进约束深度自适应聚类(CDAC)模型,性能提升高达33个百分点。相比之下,仅基于问题的CDAC模型相较于简单基线仅能提升最多13个百分点的性能。