Datasets in the real world are often complex and to some degree hierarchical, with groups and sub-groups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these datasets is an important task that has many practical applications. To address this challenge, we present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM). Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.
翻译:现实世界中的数据集通常具有复杂性和不同程度的层次结构,数据组与子组在不同抽象层次上共享共同特征。理解并揭示这些数据集的隐藏结构是一项具有重要实践意义的任务。为应对这一挑战,我们提出了一种新颖且通用的方法,通过利用受限玻尔兹曼机(RBM)的学习动态来构建关系数据树。该方法基于平均场理论,源自Plefka展开,并在无序系统框架下发展而来,具有易于解释的特点。我们在一个人工生成的层次数据集以及三个不同领域的真实数据集(手写数字图像、人类基因组突变、同源蛋白质家族)上测试了该方法。该方法能够自动识别数据的层次结构,这对于研究同源蛋白质序列(其中蛋白质之间的关系对其功能与演化理解至关重要)可能具有重要价值。