Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.
翻译:扩散模型已成为图像合成领域最先进的生成方法,但其作为通用特征编码器的潜力仍未得到充分探索。这些模型在无需标签的情况下进行去噪和生成训练,可被理解为能够捕获低层与高层结构的自监督学习器。我们证明,通过探测跨层和跨时间步的中间去噪特征,并为每一对特征训练线性分类器,冻结的扩散主干网络能够实现强大的细粒度识别能力。我们在具有实际应用价值的真实世界浮游生物监测场景中评估此方法,采用受控且可比较的训练设置,与成熟的监督学习和自监督学习基线进行对比。在平衡数据集和自然长尾分布场景中,冻结扩散特征与监督学习基线性能相当,且优于其他自监督学习方法。在时间和地理分布偏移的浮游生物数据集上进行分布外评估进一步表明,冻结扩散特征在显著分布偏移下仍能保持较高的准确率和宏观F1分数。