In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied on the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to four different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, and homologous RNA sequences from specific taxonomies.
翻译:在本研究中,我们聚焦于利用基于能量的模型生成复杂结构化数据集(如群体遗传学、RNA或蛋白质序列数据)中高质量、标签特异的数据。传统训练方法因马尔可夫链蒙特卡洛混合效率低下而面临困难,这影响了合成数据的多样性并增加了生成时间。为解决这些问题,我们采用了一种利用非平衡效应的新型训练算法。该方法应用于受限玻尔兹曼机,提升了模型正确分类样本的能力,并能在仅需少量采样步骤的情况下生成高质量合成数据。该方法的有效性通过其在四类不同数据上的成功应用得到验证:手写数字、按大陆起源分类的人类基因组突变、一个酶蛋白家族的功能特征化序列,以及特定分类学中的同源RNA序列。