New intent discovery is of great value to natural language processing, allowing for a better understanding of user needs and providing friendly services. However, most existing methods struggle to capture the complicated semantics of discrete text representations when limited or no prior knowledge of labeled data is available. To tackle this problem, we propose a novel clustering framework, USNID, for unsupervised and semi-supervised new intent discovery, which has three key technologies. First, it fully utilizes unsupervised or semi-supervised data to mine shallow semantic similarity relations and provide well-initialized representations for clustering. Second, it designs a centroid-guided clustering mechanism to address the issue of cluster allocation inconsistency and provide high-quality self-supervised targets for representation learning. Third, it captures high-level semantics in unsupervised or semi-supervised data to discover fine-grained intent-wise clusters by optimizing both cluster-level and instance-level objectives. We also propose an effective method for estimating the cluster number in open-world scenarios without knowing the number of new intents beforehand. USNID performs exceptionally well on several benchmark intent datasets, achieving new state-of-the-art results in unsupervised and semi-supervised new intent discovery and demonstrating robust performance with different cluster numbers.
翻译:新意图发现对自然语言处理具有重要价值,能够帮助更好地理解用户需求并提供友好服务。然而,当标注数据的先验知识有限或缺失时,现有方法大多难以捕捉离散文本表示的复杂语义。为解决此问题,我们提出了一种新型聚类框架USNID,用于无监督与半监督的新意图发现,该框架包含三项关键技术。首先,它充分利用无监督或半监督数据挖掘浅层语义相似关系,为聚类提供良好的初始化表征。其次,它设计了质心引导的聚类机制,以解决聚类分配不一致的问题,并为表征学习提供高质量的自监督目标。第三,通过优化聚类级和实例级双重目标,该方法能够捕捉无监督或半监督数据中的高层语义,从而发现细粒度的意图类簇。我们还提出了一种有效方法,可在开放世界场景中无需预先知道新意图数量即可估算聚类数。USNID在多个基准意图数据集上表现优异,在无监督和半监督新意图发现任务中均取得了最新最优结果,并展现出对不同聚类数目的鲁棒性能。