Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis. They have high error consistency with humans and low texture bias when used for classification. Furthermore, prior work demonstrated the decomposability of their bottleneck layer representations into semantic directions. In this work, we analyze how well such representations are aligned to human responses on a triplet odd-one-out task. We find that despite the aforementioned observations: I) The representational alignment with humans is comparable to that of models trained only on ImageNet-1k. II) The most aligned layers of the denoiser U-Net are intermediate layers and not the bottleneck. III) Text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.
翻译:基于大规模数据训练的扩散模型在图像合成中展现了卓越的性能。当用于分类任务时,这些模型与人类具有高误差一致性,且纹理偏差较低。此外,先前研究表明其瓶颈层表示可分解为语义方向。本研究通过三元组异类辨识任务分析此类表示与人类响应的对齐程度。我们发现尽管存在上述观测现象:I)其与人类的表征对齐程度与仅基于ImageNet-1k训练的模型相当;II)降噪器U-Net中对齐程度最高的层级为中间层而非瓶颈层;III)文本条件在高层级噪声条件下显著提升对齐效果,这揭示了抽象文本信息的重要性,尤其在生成早期阶段。