This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.
翻译:本文研究了新近出现的语言辅助图像聚类问题,该问题通过利用文本语义来增强视觉表征的区分性,从而促进图像聚类。由于真实类别名称的不可得性,LaIC的核心挑战之一在于如何从未标注的野生语料数据中筛选出正类名词,即与目标图像语义相近的名词。现有的筛选策略主要基于CLIP学习到的现成特征空间;然而,尽管这些策略直观,却缺乏严格的理论基础。为填补这一空白,我们提出了一种新颖的基于梯度的框架,称为GradNorm,该框架具有理论保证并展现出强大的实证性能。具体而言,我们通过反向传播的梯度幅度来衡量每个名词的"正性",这些梯度来源于预测目标分布与softmax输出之间的交叉熵。理论上,我们提供了一个严格的误差界来量化GradNorm对正类名词的分离能力,并证明GradNorm自然地将现有筛选策略囊括为其极端特例。实证方面,大量实验表明GradNorm在多种基准测试中达到了最先进的聚类性能。