Recent progress in multi-modal conditioned face synthesis has enabled the creation of visually striking and accurately aligned facial images. Yet, current methods still face issues with scalability, limited flexibility, and a one-size-fits-all approach to control strength, not accounting for the differing levels of conditional entropy, a measure of unpredictability in data given some condition, across modalities. To address these challenges, we introduce a novel uni-modal training approach with modal surrogates, coupled with an entropy-aware modal-adaptive modulation, to support flexible, scalable, and scalable multi-modal conditioned face synthesis network. Our uni-modal training with modal surrogate that only leverage uni-modal data, use modal surrogate to decorate condition with modal-specific characteristic and serve as linker for inter-modal collaboration , fully learns each modality control in face synthesis process as well as inter-modal collaboration. The entropy-aware modal-adaptive modulation finely adjust diffusion noise according to modal-specific characteristics and given conditions, enabling well-informed step along denoising trajectory and ultimately leading to synthesis results of high fidelity and quality. Our framework improves multi-modal face synthesis under various conditions, surpassing current methods in image quality and fidelity, as demonstrated by our thorough experimental results.
翻译:多模态条件人脸合成的最新进展已能够生成视觉上引人注目且精确对齐的人脸图像。然而,当前方法在可扩展性、灵活性以及控制强度方面仍面临挑战:现有方法采用"一刀切"的控制强度策略,未考虑不同模态间条件熵(一种在给定条件下数据不确定性的度量)的差异。为解决上述问题,我们提出了一种新颖的单模态训练方法结合模态代理,并引入熵感知的模态自适应调制机制,以构建灵活、可扩展且自适应的多模态条件人脸合成网络。我们的单模态训练仅利用单模态数据,通过模态代理为条件赋予模态特异性特征并作为模态间协作的桥梁,从而充分学习人脸合成过程中各模态的控制能力及其协作机制。熵感知的模态自适应调制根据模态特异性特征与给定条件精细调节扩散噪声,使去噪轨迹的每一步都更具信息性,最终生成高保真度与高质量的人脸合成结果。大量实验结果表明,我们的框架在各种条件下的多模态人脸合成中显著提升了图像质量与保真度,全面超越了现有方法。