In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.
翻译:在潜在扩散模型中,自编码器通常被期望平衡两种能力:忠实重建和生成友好的潜在空间(例如低gFID)。在最近的ImageNet规模自编码器研究中,我们观察到处理这种权衡时存在系统性偏向生成指标的现象:重建指标的报告日益不足,基于消融的自编码器选择往往倾向于最佳gFID配置,即使重建保真度下降。我们从理论上分析了为何这种gFID主导的偏好对于ImageNet生成看似无碍,但在扩展到可控扩散时却变得危险:自编码器可能引发条件漂移,从而限制可达到的条件对齐程度。同时,我们发现重建保真度,尤其是实例级度量,能更好地指示可控性。我们通过研究多个近期ImageNet自编码器,实证验证了倾斜的自编码器评估对可控性的影响。采用反映可控生成任务的多维条件漂移评估方案,我们发现gFID仅能弱预测条件保持能力,而面向重建的指标则显著更相关。ControlNet实验进一步证实可控性跟踪的是条件保持能力而非gFID。总体而言,我们的结果揭示了以ImageNet为中心的自编码器评估与可扩展可控扩散需求之间的差距,为更可靠的基准测试和模型选择提供了实用指导。