Generalized Category Discovery (GCD) is a crucial task that aims to recognize both known and novel categories from a set of unlabeled data by utilizing a few labeled data with only known categories. Due to the lack of supervision and category information, current methods usually perform poorly on novel categories and struggle to reveal semantic meanings of the discovered clusters, which limits their applications in the real world. To mitigate the above issues, we propose Loop, an end-to-end active-learning framework that introduces Large Language Models (LLMs) into the training loop, which can boost model performance and generate category names without relying on any human efforts. Specifically, we first propose Local Inconsistent Sampling (LIS) to select samples that have a higher probability of falling to wrong clusters, based on neighborhood prediction consistency and entropy of cluster assignment probabilities. Then we propose a Scalable Query strategy to allow LLMs to choose true neighbors of the selected samples from multiple candidate samples. Based on the feedback from LLMs, we perform Refined Neighborhood Contrastive Learning (RNCL) to pull samples and their neighbors closer to learn clustering-friendly representations. Finally, we select representative samples from clusters corresponding to novel categories to allow LLMs to generate category names for them. Extensive experiments on three benchmark datasets show that Loop outperforms SOTA models by a large margin and generates accurate category names for the discovered clusters. Code and data are available at https://github.com/Lackel/LOOP.
翻译:广义类别发现(GCD)是一项关键任务,旨在利用少量仅包含已知类别的标注数据,从一组未标注数据中识别已知类别与新颖类别。由于缺乏监督信息与类别先验,现有方法通常在新颖类别上表现不佳,且难以揭示所发现簇的语义含义,这限制了其在实际场景中的应用。为缓解上述问题,我们提出Loop——一种端到端的主动学习框架,将大语言模型(LLMs)引入训练循环,可在无需人工干预的情况下提升模型性能并生成类别名称。具体而言,我们首先提出局部不一致采样(LIS)策略,基于邻域预测一致性和簇分配概率的熵,筛选出更可能被错误聚类的样本。随后提出可扩展查询策略,使LLMs能够从多个候选样本中选出选定样本的真实邻居。基于LLMs的反馈,我们执行精细化邻域对比学习(RNCL),拉近样本与其邻居的距离以学习利于聚类的表征。最后,我们从对应新颖类别的簇中选取代表性样本,由LLMs为其生成类别名称。在三个基准数据集上的大量实验表明,Loop以显著优势超越现有最优模型,并为所发现的簇生成准确的类别名称。代码与数据公开于https://github.com/Lackel/LOOP。