Representation learning aims to extract meaningful lower-dimensional embeddings from data, known as representations. Despite its widespread application, there is no established definition of a ``good'' representation. Typically, the representation quality is evaluated based on its performance in downstream tasks such as clustering, de-noising, etc. However, this task-specific approach has a limitation where a representation that performs well for one task may not necessarily be effective for another. This highlights the need for a more agnostic formulation, which is the focus of our work. We propose a downstream-agnostic formulation: when inherent clusters exist in the data, the representations should be specific to each cluster. Under this idea, we develop a meta-algorithm that jointly learns cluster-specific representations and cluster assignments. As our approach is easy to integrate with any representation learning framework, we demonstrate its effectiveness in various setups, including Autoencoders, Variational Autoencoders, Contrastive learning models, and Restricted Boltzmann Machines. We qualitatively compare our cluster-specific embeddings to standard embeddings and downstream tasks such as de-noising and clustering. While our method slightly increases runtime and parameters compared to the standard model, the experiments clearly show that it extracts the inherent cluster structures in the data, resulting in improved performance in relevant applications.
翻译:表示学习旨在从数据中提取有意义的低维嵌入,称为表示。尽管其应用广泛,但尚未建立关于“良好”表示的明确定义。通常,表示质量基于其在聚类、去噪等下游任务中的性能进行评估。然而,这种任务特定方法存在局限性:对某一任务表现良好的表示,未必适用于其他任务。这凸显了对一种更不可知论(agnostic)的表述的需求,这正是本工作的核心。我们提出一种下游任务不可知的表述:当数据中存在固有聚类时,表示应针对每个聚类具有特异性。基于这一思想,我们开发了一种元算法,能够联合学习聚类特定表示和聚类分配。由于我们的方法易于与任何表示学习框架集成,我们在多种设置中验证了其有效性,包括自编码器、变分自编码器、对比学习模型和受限玻尔兹曼机。我们定性地比较了聚类特定嵌入与标准嵌入,以及在下游任务(如去噪和聚类)中的表现。尽管与标准模型相比,我们的方法略微增加了运行时间和参数数量,但实验清楚地表明,它能够提取数据中的固有聚类结构,从而在相关应用中提升性能。