Diffusion models are capable of impressive feats of image generation with uncommon juxtapositions such as astronauts riding horses on the moon with properly placed shadows. These outputs indicate the ability to perform compositional generalization, but how do the models do so? We perform controlled experiments on conditional DDPMs learning to generate 2D spherical Gaussian bumps centered at specified $x$- and $y$-positions. Our results show that the emergence of semantically meaningful latent representations is key to achieving high performance. En route to successful performance over learning, the model traverses three distinct phases of latent representations: (phase A) no latent structure, (phase B) a 2D manifold of disordered states, and (phase C) a 2D ordered manifold. Corresponding to each of these phases, we identify qualitatively different generation behaviors: 1) multiple bumps are generated, 2) one bump is generated but at inaccurate $x$ and $y$ locations, 3) a bump is generated at the correct $x$ and y location. Furthermore, we show that even under imbalanced datasets where features ($x$- versus $y$-positions) are represented with skewed frequencies, the learning process for $x$ and $y$ is coupled rather than factorized, demonstrating that simple vanilla-flavored diffusion models cannot learn efficient representations in which localization in $x$ and $y$ are factorized into separate 1D tasks. These findings suggest the need for future work to find inductive biases that will push generative models to discover and exploit factorizable independent structures in their inputs, which will be required to vault these models into more data-efficient regimes.
翻译:扩散模型能够生成具有不寻常组合的令人印象深刻的图像,例如宇航员骑着马在月球上,且阴影位置恰当。这些输出表明模型具备组合泛化能力,但模型是如何做到的呢?我们对条件性DDPM进行了控制实验,使其学习生成以指定$x$和$y$位置为中心的二维球形高斯凸起。我们的结果表明,语义上有意义的潜在表示的出现是实现高性能的关键。在学习过程中通向成功性能的路径上,模型经历了三个不同的潜在表示阶段:(阶段A)无潜在结构,(阶段B)无序状态的二维流形,以及(阶段C)有序的二维流形。对应于每个阶段,我们识别出性质不同的生成行为:1)生成多个凸起,2)生成一个凸起但$x$和$y$位置不准确,3)在正确的$x$和$y$位置生成一个凸起。此外,我们表明,即使在特征($x$位置与$y$位置)以偏斜频率表示的不平衡数据集下,$x$和$y$的学习过程是耦合的而非分解的,表明简单的标准扩散模型无法学习到将$x$和$y$的定位分解为独立的一维任务的高效表示。这些发现表明未来工作需要找到归纳偏置,以推动生成模型发现并利用输入中可分解的独立结构,这将是将这些模型推向更高效数据利用区域的必要条件。