Typical deep clustering methods, while achieving notable progress, can only provide one clustering result per dataset. This limitation arises from their assumption of a fixed underlying data distribution, which may fail to meet user needs and provide unsatisfactory clustering outcomes. Our work investigates how multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering, emphasizing their adaptability to user-specified semantic requirements. However, directly using MLLM output for clustering has risks for producing unstructured and generic image descriptions instead of feature-specific and concrete ones. To address these issues, our method first discovers that MLLMs' hidden states of text tokens are strongly related to the corresponding features, and leverages these embeddings to perform clusterings from any user-defined criteria. We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy. Extensive experiments demonstrate its competitive performance on diverse datasets and metrics.
翻译:典型的深度聚类方法虽取得显著进展,但仅能为每个数据集提供单一聚类结果。这一局限性源于其假设数据服从固定底层分布,可能无法满足用户需求并产生不理想的聚类效果。本研究探讨如何利用多模态大语言模型(MLLMs)实现用户驱动的聚类,重点强调其对用户指定语义需求的适应性。然而,直接使用MLLM输出进行聚类存在风险,可能生成非结构化、通用化的图像描述,而非针对特征的具体描述。为解决这些问题,本方法首先发现MLLM文本标记的隐藏状态与对应特征高度相关,并利用这些嵌入向量根据任意用户定义准则执行聚类。我们还采用结合伪标签学习的轻量级聚类头,显著提升了聚类准确性。大量实验证明,该方法在多样化数据集和评估指标上均展现出具有竞争力的性能。