Continuous-time generative models have achieved remarkable success in image restoration and synthesis. However, controlling the composition of multiple pre-trained models remains an open challenge. Current approaches largely treat composition as an algebraic composition of probability densities, such as via products or mixtures of experts. This perspective assumes the target distribution is known explicitly, which is almost never the case. In this work, we propose a different paradigm that formulates compositional generation as a cooperative Stochastic Optimal Control problem. Rather than combining probability densities, we treat pre-trained diffusion models as interacting agents whose diffusion trajectories are jointly steered, via optimal control, toward a shared objective defined on their aggregated output. We validate our framework on conditional MNIST generation and compare it against a naive inference-time DPS-style baseline replacing learned cooperative control with per-step gradient guidance.
翻译:连续时间生成模型在图像修复与合成领域取得了显著成功。然而,如何有效控制多个预训练模型的组合仍是一个开放性问题。现有方法大多将组合视为概率密度的代数组合,例如通过专家乘积或混合模型实现。这种视角假设目标分布可显式表达,而实际情况几乎从不满足。本工作提出一种新范式,将组合生成问题构建为协作式随机最优控制问题。我们不再直接组合概率密度,而是将预训练扩散模型视为交互智能体,通过最优控制方法联合引导其扩散轨迹,使其朝向基于聚合输出定义的共享目标收敛。我们在条件MNIST生成任务上验证了所提框架,并与一种朴素推理时DPS式基线方法进行了对比,该基线方法使用每步梯度引导替代了习得的协作控制。