UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $\textbf{(1)}$ $\textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $\textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $\textbf{UniF$^2$ace}$, $\textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $\textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. $\textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $\textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1\% higher Desc-GPT and 6.6\% higher VQA-score, respectively.

翻译：统一多模态模型已成为基础跨模态研究中的一个强大范式，在图像理解和生成方面展现出巨大潜力。然而，人脸领域现有研究主要面临两大挑战：$\textbf{(1)}$ $\textbf{碎片化发展}$，现有方法未能将理解与生成统一于单一模型，阻碍了通往通用人工智能的道路。$\textbf{(2) 缺乏细粒度面部属性}$，而这些属性对于高保真应用至关重要。为解决这些问题，我们提出$\textbf{UniF$^2$ace}$，$\textit{首个专为细粒度人脸理解与生成定制的统一多模态模型}$。$\textbf{首先}$，我们引入一种新颖的理论框架，采用双离散扩散损失，将掩码生成模型与离散分数匹配扩散相统一，从而实现对负对数似然的更精确逼近。此外，该D3Diff损失显著增强了模型合成与文本输入对齐的高保真面部细节的能力。$\textbf{其次}$，我们提出一种多层级分组专家混合架构，自适应地融合语义与身份面部嵌入，以弥补表征演化过程中的属性遗忘现象。$\textbf{最后}$，为此我们构建了UniF$^2$aceD-1M大规模数据集，包含13万细粒度图像-文本对及100万视觉问答对，涵盖的面部属性范围远超现有数据集。大量实验表明，UniF$^2$ace在理解与生成任务上均优于同等规模的现有模型，其Desc-GPT得分和VQA得分分别高出7.1%和6.6%。