Data clustering, the task of grouping observations according to their similarity, is a key component of unsupervised learning -- with real world applications in diverse fields such as biology, medicine, and social science. Often in these fields the data comes with complex interdependencies between the dimensions of analysis, for instance the various characteristics and opinions people can have live on a complex social network. Current clustering methods are ill-suited to tackle this complexity: deep learning can approximate these dependencies, but not take their explicit map as the input of the analysis. In this paper, we aim at fixing this blind spot in the unsupervised learning literature. We can create network-aware embeddings by estimating the network distance between numeric node attributes via the generalized Euclidean distance. Differently from all methods in the literature that we know of, we do not cluster the nodes of the network, but rather its node attributes. In our experiments we show that having these network embeddings is always beneficial for the learning task; that our method scales to large networks; and that we can actually provide actionable insights in applications in a variety of fields such as marketing, economics, and political science. Our method is fully open source and data and code are available to reproduce all results in the paper.
翻译:数据聚类是根据观测值相似性进行分组的任务,是无监督学习的核心组成部分,在生物学、医学和社会科学等不同领域具有实际应用。在这些领域中,数据通常伴随着分析维度之间的复杂相互依赖关系,例如人们可能拥有的各种特征和观点存在于复杂的社会网络中。当前的聚类方法难以应对这种复杂性:深度学习可以近似这些依赖关系,但无法将其显式映射作为分析的输入。本文旨在解决无监督学习文献中的这一盲点。我们通过广义欧氏距离估计数值节点属性之间的网络距离,从而创建网络感知嵌入。与文献中已知的所有方法不同,我们不对网络节点进行聚类,而是对其节点属性进行聚类。实验表明,这些网络嵌入始终对学习任务有益;我们的方法可扩展至大型网络;并且能够在市场营销、经济学和政治学等多种领域的应用中提供可操作的见解。我们的方法完全开源,论文中所有结果均可通过公开的数据和代码复现。