Do Diffusion Models Learn Semantically Meaningful and Efficient Representations?

Diffusion models are capable of impressive feats of image generation with uncommon juxtapositions such as astronauts riding horses on the moon with properly placed shadows. These outputs indicate the ability to perform compositional generalization, but how do the models do so? We perform controlled experiments on conditional DDPMs learning to generate 2D spherical Gaussian bumps centered at specified $x$- and $y$-positions. Our results show that the emergence of semantically meaningful latent representations is key to achieving high performance. En route to successful performance over learning, the model traverses three distinct phases of latent representations: (phase A) no latent structure, (phase B) a 2D manifold of disordered states, and (phase C) a 2D ordered manifold. Corresponding to each of these phases, we identify qualitatively different generation behaviors: 1) multiple bumps are generated, 2) one bump is generated but at inaccurate $x$ and $y$ locations, 3) a bump is generated at the correct $x$ and y location. Furthermore, we show that even under imbalanced datasets where features ($x$- versus $y$-positions) are represented with skewed frequencies, the learning process for $x$ and $y$ is coupled rather than factorized, demonstrating that simple vanilla-flavored diffusion models cannot learn efficient representations in which localization in $x$ and $y$ are factorized into separate 1D tasks. These findings suggest the need for future work to find inductive biases that will push generative models to discover and exploit factorizable independent structures in their inputs, which will be required to vault these models into more data-efficient regimes.

翻译：扩散模型能够生成具有罕见组合（如宇航员骑在月球上的马匹，并伴有恰当放置的阴影）的惊人图像。这些输出表明模型具备组合泛化的能力，但模型是如何实现这一点的呢？我们针对条件DDPM进行了控制实验，使其学习生成以指定 $x$ 和 $y$ 位置为中心的二维球形高斯凸起。结果显示，语义上有意义的潜在表示的出现是实现高性能的关键。在通过学习达到成功性能的过程中，模型经历了三个不同的潜在表示阶段：（A阶段）无潜在结构，（B阶段）无序状态的二维流形，（C阶段）有序的二维流形。对应于每个阶段，我们识别出性质不同的生成行为：1）生成多个凸起，2）生成一个凸起但位于不准确的 $x$ 和 $y$ 位置，3）在正确的 $x$ 和 $y$ 位置生成一个凸起。此外，我们表明，即使在特征（$x$ 位置与 $y$ 位置）以不均匀频率表示的不平衡数据集上，$x$ 和 $y$ 的学习过程也是耦合而非分离的，这表明简单的标准扩散模型无法学习到将 $x$ 和 $y$ 的定位分解为独立一维任务的高效表示。这些发现表明，未来需要寻找归纳偏置，以促使生成模型发现并利用输入中可分解的独立结构，这是将这些模型推向更高数据效率范式的必要前提。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日