Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.
翻译:扩散Transformer已展现出卓越的生成能力,但其去噪轨迹中计算的丰富感知表征在内容渲染完成后便被丢弃。我们提出MMDiff框架,该框架将冻结的扩散Transformer转化为多模态生成系统,可通过轻量化解码器头联合生成图像与任意稠密感知模态的组合。我们的核心发现是:感知信息在去噪轨迹中具有时间分布特性,采用具有空间变化聚合权重的多时间步特征融合至关重要,其语义分割结果相较单时间步提取方法在mIoU指标上提升高达28.7%。我们进一步采用概念驱动注意力提取技术实现可解释的空间引导,并证明冻结的扩散特征与DINOv3等最先进编码器具有竞争性和互补性。通过仅训练冻结主干网络上的轻量化解码器头,我们在语义分割、显著目标检测和深度估计任务中取得了优异性能,并论证了该框架可有效实现大规模合成数据生成。