AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs--defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

翻译：面部行为合成仍是一个关键但尚未充分探索的挑战。尽管文本到人脸模型已取得进展，但它们通常依赖粗略的情绪类别，缺乏捕捉人类非语言交流全貌所需的细腻度。动作单元（AU）提供了一种更精确且基于解剖学的替代方案。然而，当前基于AU的方法通常将AU编码为独热向量，将复合表情建模为单个AU的简单线性组合。这种线性在处理冲突AU（即激活同一面部肌肉但动作相反的AU）时会出现问题，导致解剖学上不合理的伪影和不自然的动作叠加。为此，我们提出了一种新方法，通过AU的自然语言描述来表示面部行为。该方法保留了AU框架的表达力，同时能够显式建模复杂和冲突的AU。它还释放了现代文本到图像模型在高保真面部合成中的潜力。为支持这一方向，我们引入了BP4D-AUText，这是第一个用于复杂面部行为的大规模文本-图像配对数据集。它通过将基于规则的动态AU文本处理器应用于BP4D和BP4D+数据集合成而成。我们进一步提出了VQ-AUFace，一种生成模型，利用面部结构先验从文本合成逼真且多样化的面部行为。大量定量实验和用户研究表明，我们的方法显著优于现有方法。它在生成解剖学上合理、行为丰富且感知上令人信服的面部表情方面表现出色，尤其是在涉及冲突AU的挑战性条件下。