Intent discovery is the task of inferring latent intents from a set of unlabeled utterances, and is a useful step towards the efficient creation of new conversational agents. We show that recent competitive methods in intent discovery can be outperformed by clustering utterances based on abstractive summaries, i.e., "labels", that retain the core elements while removing non-essential information. We contribute the IDAS approach, which collects a set of descriptive utterance labels by prompting a Large Language Model, starting from a well-chosen seed set of prototypical utterances, to bootstrap an In-Context Learning procedure to generate labels for non-prototypical utterances. The utterances and their resulting noisy labels are then encoded by a frozen pre-trained encoder, and subsequently clustered to recover the latent intents. For the unsupervised task (without any intent labels) IDAS outperforms the state-of-the-art by up to +7.42% in standard cluster metrics for the Banking, StackOverflow, and Transport datasets. For the semi-supervised task (with labels for a subset of intents) IDAS surpasses 2 recent methods on the CLINC benchmark without even using labeled data.
翻译:意图发现是从一组无标注语句中推断潜在意图的任务,是高效创建新对话代理的重要步骤。研究表明,当前意图发现领域的先进方法可被基于摘要式概括(即“标签”)的语句聚类所超越——这些标签保留了语句核心要素,同时去除了非关键信息。我们提出IDAS方法,通过提示大语言模型,从精心挑选的原型语句种子集出发,启动上下文学习过程,为非原型语句生成描述性标签,从而收集一组具有描述性的语句标签。随后,利用冻结的预训练编码器对语句及其产生的含噪标签进行编码,并通过聚类恢复潜在意图。在无监督任务(无任何意图标签)中,IDAS在Banking、StackOverflow和Transport数据集上的标准聚类指标上最高超出当前最优方法7.42%。在半监督任务(仅部分意图有标签)中,IDAS在不使用任何标注数据的情况下,在CLINC基准上超越了两种近期方法。