Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.
翻译:将复杂数据分解为因子化表示可以揭示可复用组件,并支持通过组件重组合成新样本。本研究在基于扩散的模型框架下探讨此问题,该模型可在无因子级监督的情况下学习因子化潜在空间。在图像中,因子可捕捉背景、光照与物体属性;在机器人视频中,则可捕捉可复用的运动组件。为提升潜在因子发现能力与组合生成质量,我们引入通过判别器实现的对抗训练信号:该判别器经训练可区分单源样本与跨源因子重组生成的样本。通过优化生成器以欺骗该判别器,我们促使重组结果具有物理与语义一致性。在CelebA-HQ、Virtual KITTI、CLEVR和Falcor3D数据集上,本方法优于现有基线实现,获得了更低的FID分数,并通过MIG和MCC指标测得更好的解耦效果。此外,我们展示了在机器人视频轨迹中的创新应用:通过重组已学习的动作组件,可生成多样化序列,显著提升LIBERO基准测试中探索任务的状态空间覆盖率。