Video retargeting for digital face animation is used in virtual reality, social media, gaming, movies, and video conference, aiming to animate avatars' facial expressions based on videos of human faces. The standard method to represent facial expressions for 3D characters is by blendshapes, a vector of weights representing the avatar's neutral shape and its variations under facial expressions, e.g., smile, puff, blinking. Datasets of paired frames with blendshape vectors are rare, and labeling can be laborious, time-consuming, and subjective. In this work, we developed an approach that handles the lack of appropriate datasets. Instead, we used a synthetic dataset of only one character. To generalize various characters, we re-represented each frame to face landmarks. We developed a unique deep-learning architecture that groups landmarks for each facial organ and connects them to relevant blendshape weights. Additionally, we incorporated complementary methods for facial expressions that landmarks did not represent well and gave special attention to eye expressions. We have demonstrated the superiority of our approach to previous research in qualitative and quantitative metrics. Our approach achieved a higher MOS of 68% and a lower MSE of 44.2% when tested on videos with various users and expressions.
翻译:数字人脸动画中的视频重定向技术广泛应用于虚拟现实、社交媒体、游戏、电影及视频会议等领域,旨在基于人脸视频驱动虚拟角色的面部表情。三维角色表情的标准化表示方法是混合变形(blendshapes),即一组权重向量,代表角色的中性脸型及其在表情(如微笑、鼓腮、眨眼)下的变形状态。带有混合变形向量的配对帧数据集较为稀缺,且标注过程费时费力且存在主观性。本文提出了一种解决数据集不足问题的方法:仅使用单一角色的合成数据集。为泛化至多种角色,我们将每一帧重新表示为面部关键点。我们设计了独特的深度学习架构,将各面部器官的关键点分组,并映射至对应的混合变形权重。此外,针对关键点难以充分表征的表情,我们补充了互补方法,并特别关注眼部表情。通过定性与定量指标对比,本方法相较前人研究具有显著优势。在包含多种用户与表情的视频测试中,本方法实现了68%的更高平均意见分(MOS),且均方误差(MSE)降低44.2%。