Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step stochastic diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) The stochastic nature of diffusion models has a slightly negative impact on deterministic visual perception tasks. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific image-level supervision is beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks. Different from the previous multi-step methods, our paradigm has a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for image-level supervision, which is critical to improving the fine-grained details of predictions. Comprehensive experiments on diverse dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.
翻译:大规模数据的广泛预训练对于下游几何与语义视觉感知任务至关重要。得益于大规模文本到图像(T2I)预训练,近期研究通过简单微调T2I扩散模型进行密集感知任务,已展现出有前景的结果。然而,这一过程中的若干关键设计决策仍缺乏充分论证,包括多步随机扩散机制的必要性、训练策略、推理集成策略以及微调数据质量。在本工作中,我们对影响扩散先验迁移效率与性能的关键因素进行了全面研究。我们的主要发现是:1)高质量的微调数据对于语义与几何感知任务均至关重要。2)扩散模型的随机性对确定性视觉感知任务具有轻微的负面影响。3)除了仅在潜空间监督下微调扩散模型外,任务特定的图像级监督有助于增强细粒度细节。这些观察最终促成了GenPercept的开发,这是一种专为密集视觉感知任务定制的、有效的确定性一步式微调范式。与以往的多步方法不同,我们的范式具有更快的推理速度,并且能够与定制化的感知解码器及用于图像级监督的损失函数无缝集成,这对于提升预测的细粒度细节至关重要。我们在多种密集视觉感知任务上进行了全面实验,包括单目深度估计、表面法向估计、图像分割与抠图,以证明所提方法卓越的适应性与有效性。