In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied on the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to four different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, and homologous RNA sequences from specific taxonomies.
翻译:本研究解决了在复杂结构化数据集(如群体遗传学、RNA或蛋白质序列数据)中使用基于能量的模型生成高质量、标签特异性数据的挑战。传统训练方法因马尔可夫链蒙特卡洛混合效率低下而遇到困难,这影响了合成数据的多样性并增加了生成时间。为解决这些问题,我们采用了一种利用非平衡效应的新型训练算法。该方法应用于受限玻尔兹曼机,仅需少量采样步骤即可提升模型正确分类样本及生成高质量合成数据的能力。该方法的有效性通过对四类不同数据的成功应用得到验证:手写数字、按大陆起源分类的人类基因组突变、具有功能特征的酶蛋白家族序列,以及来自特定分类群的同源RNA序列。