New intent discovery is of great value to natural language processing, allowing for a better understanding of user needs and providing friendly services. However, most existing methods struggle to capture the complicated semantics of discrete text representations when limited or no prior knowledge of labeled data is available. To tackle this problem, we propose a novel framework called USNID for unsupervised and semi-supervised new intent discovery, which has three key technologies. First, it takes full use of unsupervised or semi-supervised data to mine shallow semantic similarity relations and provide well-initialized representations for clustering. Second, it designs a centroid-guided clustering mechanism to address the issue of cluster allocation inconsistency and provide high-quality self-supervised targets for representation learning. Third, it captures high-level semantics in unsupervised or semi-supervised data to discover fine-grained intent-wise clusters by optimizing both cluster-level and instance-level objectives. We also propose an effective method for estimating the cluster number in open-world scenarios without knowing the number of new intents beforehand. USNID performs exceptionally well on several intent benchmark datasets, achieving new state-of-the-art results in unsupervised and semi-supervised new intent discovery and demonstrating robust performance with different cluster numbers.
翻译:新意图发现对自然语言处理具有重要价值,能更深入理解用户需求并提供友好服务。然而,现有方法大多难以在缺乏标注数据先验知识或仅有少量标注数据时,有效捕获离散文本表示的复杂语义。为解决该问题,我们提出名为USNID的新型框架,用于无监督与半监督新意图发现,该框架包含三项关键技术:第一,充分利用无监督或半监督数据挖掘浅层语义相似关系,为聚类提供良好初始化的表示;第二,设计质心引导聚类机制解决聚类分配不一致问题,为表示学习提供高质量自监督目标;第三,通过联合优化聚类级与实例级目标,捕获无监督或半监督数据中的高层语义,发现细粒度意图簇。我们还提出在开放世界场景中无需预知新意图数量的有效簇数估计方法。USNID在多个意图基准数据集上表现优异,在无监督与半监督新意图发现任务中均取得新的最优结果,并在不同簇数设置下展现出稳健性能。