Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $\textbf{(1)}$ $\textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $\textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $\textbf{UniF$^2$ace}$, $\textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $\textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. $\textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $\textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1\% higher Desc-GPT and 6.6\% higher VQA-score, respectively.
翻译:统一多模态模型已成为基础跨模态研究中的一个强大范式,在图像理解和生成方面展现出巨大潜力。然而,人脸领域现有研究主要面临两大挑战:$\textbf{(1)}$ $\textbf{碎片化发展}$,现有方法未能将理解与生成统一于单一模型,阻碍了通往通用人工智能的道路。$\textbf{(2) 缺乏细粒度面部属性}$,而这些属性对于高保真应用至关重要。为解决这些问题,我们提出$\textbf{UniF$^2$ace}$,$\textit{首个专为细粒度人脸理解与生成定制的统一多模态模型}$。$\textbf{首先}$,我们引入一种新颖的理论框架,采用双离散扩散损失,将掩码生成模型与离散分数匹配扩散相统一,从而实现对负对数似然的更精确逼近。此外,该D3Diff损失显著增强了模型合成与文本输入对齐的高保真面部细节的能力。$\textbf{其次}$,我们提出一种多层级分组专家混合架构,自适应地融合语义与身份面部嵌入,以弥补表征演化过程中的属性遗忘现象。$\textbf{最后}$,为此我们构建了UniF$^2$aceD-1M大规模数据集,包含13万细粒度图像-文本对及100万视觉问答对,涵盖的面部属性范围远超现有数据集。大量实验表明,UniF$^2$ace在理解与生成任务上均优于同等规模的现有模型,其Desc-GPT得分和VQA得分分别高出7.1%和6.6%。