Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.
翻译:组合泛化——即生成已知概念新颖组合的能力——是视觉生成模型的关键要素。然而,并非所有促进或抑制该能力的机制都已被完全理解。本研究系统探讨了各类设计选择如何以积极或消极方式影响图像与视频生成中的组合泛化。通过受控实验,我们识别出两个关键因素:(i) 训练目标作用于离散分布还是连续分布;(ii) 训练过程中条件输入在多大程度上提供构成概念的信息。基于这些发现,我们证明通过引入基于连续JEPA的辅助目标来松弛MaskGIT的离散损失,能够提升如MaskGIT这类离散模型的组合性能。