Learning a categorical distribution comes with its own set of challenges. A successful approach taken by state-of-the-art works is to cast the problem in a continuous domain to take advantage of the impressive performance of the generative models for continuous data. Amongst them are the recently emerging diffusion probabilistic models, which have the observed advantage of generating high-quality samples. Recent advances for categorical generative models have focused on log likelihood improvements. In this work, we propose a generative model for categorical data based on diffusion models with a focus on high-quality sample generation, and propose sampled-based evaluation methods. The efficacy of our method stems from performing diffusion in the continuous domain while having its parameterization informed by the structure of the categorical nature of the target distribution. Our method of evaluation highlights the capabilities and limitations of different generative models for generating categorical data, and includes experiments on synthetic and real-world protein datasets.
翻译:学习分类分布本身面临着一系列挑战。当前前沿研究采用的一种成功方法是将问题转换到连续域中,以利用连续数据生成模型的优异性能。其中,近期涌现的扩散概率模型具有生成高质量样本的显著优势。近年来分类生成模型的研究进展主要集中在提升对数似然上。本文提出了一种基于扩散模型的分类数据生成模型,重点聚焦于高质量样本生成,并提出了基于采样的评估方法。该方法的核心优势在于:在连续域中进行扩散的同时,其参数化过程充分结合了目标分布的分类结构特性。我们所提出的评估方法揭示了不同生成模型在分类数据生成任务中的能力与局限性,并包含在合成数据集和真实蛋白质数据集上的实验验证。