Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack low-level information (\eg, color and texture), leading to degraded reconstruction fidelity, which has emerged as a primary bottleneck in further scaling LDMs. To address this limitation, we propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information, enabling high-fidelity reconstruction while remaining highly aligned with the semantic distribution. We further observe that the resulting high-dimensional, information-rich latent make decoders sensitive to latent perturbations, causing severe artifacts when decoding generated latent and consequently degrading generation quality. Our analysis suggests that this sensitivity primarily stems from excessive decoder responses along directions off the data manifold. Building on these insights, we propose fine-tuning the decoder to increase its robustness and smoothing the generated latent via controlled noise injection, thereby enhancing generation quality. Experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction and achieving strong generative quality. Our code is available at https://github.com/modyu-liu/LVRAE.
翻译:近期研究利用视觉基础模型作为图像编码器,以提升潜在扩散模型(LDMs)的生成性能,因为其语义特征分布易于学习。然而,此类语义特征通常缺乏低级信息(例如颜色和纹理),导致重建保真度下降,这已成为进一步扩展LDMs的主要瓶颈。为克服这一限制,我们提出LV-RAE,一种表示自编码器,它通过补充缺失的低级信息来增强语义特征,从而实现高保真重建,同时保持与语义分布的高度对齐。我们进一步观察到,由此产生的高维、信息丰富的潜在表示使解码器对潜在扰动敏感,导致在解码生成潜在表示时产生严重伪影,从而降低生成质量。我们的分析表明,这种敏感性主要源于解码器在数据流形外方向上的过度响应。基于这些发现,我们提出通过微调解码器以增强其鲁棒性,并通过受控噪声注入平滑生成的潜在表示,从而提升生成质量。实验表明,LV-RAE在显著改善重建保真度的同时,保留了语义抽象能力并实现了强大的生成质量。我们的代码可在 https://github.com/modyu-liu/LVRAE 获取。