Diffusion models excel at generation, but their latent spaces are high dimensional and not explicitly organized for interpretation or control. We introduce ConDA (Contrastive Diffusion Alignment), a plug-and-play geometry layer that applies contrastive learning to pretrained diffusion latents using auxiliary variables (e.g., time, stimulation parameters, facial action units). ConDA learns a low-dimensional embedding whose directions align with underlying dynamical factors, consistent with recent contrastive learning results on structured and disentangled representations. In this embedding, simple nonlinear trajectories support smooth interpolation, extrapolation, and counterfactual editing while rendering remains in the original diffusion space. ConDA separates editing and rendering by lifting embedding trajectories back to diffusion latents with a neighborhood-preserving kNN decoder and is robust across inversion solvers. Across fluid dynamics, neural calcium imaging, therapeutic neurostimulation, facial expression dynamics, and monkey motor cortex activity, ConDA yields more interpretable and controllable latent structure than linear traversals and conditioning-based baselines, indicating that diffusion latents encode dynamics-relevant structure that can be exploited by an explicit contrastive geometry layer.
翻译:扩散模型在生成任务上表现出色,但其潜空间维度高且未显式组织以支持解释或控制。我们提出了ConDA(对比扩散对齐),这是一种即插即用的几何层,它利用辅助变量(例如时间、刺激参数、面部动作单元)对预训练扩散潜变量应用对比学习。ConDA学习一个低维嵌入,其方向与底层动态因子对齐,这与近期关于结构化和解耦表示的对比学习结果一致。在此嵌入空间中,简单的非线性轨迹支持平滑的插值、外推和反事实编辑,而渲染过程仍保持在原始扩散空间内。ConDA通过一个保持邻域的kNN解码器将嵌入轨迹提升回扩散潜变量,从而分离了编辑与渲染过程,并且对不同的反演求解器具有鲁棒性。在流体动力学、神经钙成像、治疗性神经刺激、面部表情动力学以及猴子运动皮层活动等多个领域中,ConDA相比线性遍历和基于条件化的基线方法,产生了更具可解释性和可控性的潜结构。这表明扩散潜变量编码了与动力学相关的结构,这种结构可以通过显式的对比几何层加以利用。