Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limit their efficient deployment. This study aims to develop a lightweight model for speech-driven talking-face synthesis. We build a compact generator by removing the residual blocks and reducing the channel width from Wav2Lip, a popular talking-face generator. We also present a knowledge distillation scheme to stably yet effectively train the small-capacity generator without adversarial learning. We reduce the number of parameters and MACs by 28$\times$ while retaining the performance of the original model. Moreover, to alleviate a severe performance drop when converting the whole generator to INT8 precision, we adopt a selective quantization method that uses FP16 for the quantization-sensitive layers and INT8 for the other layers. Using this mixed precision, we achieve up to a 19$\times$ speedup on edge GPUs without noticeably compromising the generation quality.
翻译:虚拟人类在娱乐、电子商务等多个行业中备受关注。作为核心技术,利用生成对抗网络从目标语音和人脸身份合成逼真人脸帧的研究已十分活跃。尽管现代说话人脸生成模型取得了显著成果,但其通常伴随高昂的计算开销,限制了高效部署。本研究旨在开发面向语音驱动说话人脸合成的轻量级模型。通过移除Wav2Lip(一种流行的说话人脸生成器)中的残差模块并缩减通道宽度,我们构建了一个紧凑型生成器。我们还提出了一种知识蒸馏方案,可在无需对抗学习的情况下稳定且有效地训练小型生成器。我们将参数量和乘加运算量减少28倍,同时保持原模型性能。此外,为缓解将整个生成器转换为INT8精度时产生的严重性能下降,我们采用了一种选择性量化方法:对量化敏感层使用FP16,其余层使用INT8。借助这种混合精度策略,我们可在边缘GPU上实现高达19倍的加速,且生成质量无明显下降。