The Talking Face Generation task has enormous potential for various applications in digital humans and agents, etc. Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures. However, it is often underestimated in the field due to lack of singing face datasets and the domain gap between singing and talking in rhythm and amplitude. More significantly, the quality of Singing Face Generation (SFG) often falls short and is uneven or limited by different applicable scenarios, which prompts us to propose timely and effective quality assessment methods to ensure user experience. To address existing gaps in this domain, this paper introduces a new SFG content quality assessment dataset SFQA, built using 12 representative generation methods. During the construction of the dataset, 100 photographs or portraits, as well as 36 music clips from 7 different styles, are utilized to generate 5,184 singing face videos that constitute the SFQA dataset. To further explore the quality of SFG methods, subjective quality assessment is conducted by evaluators, whose ratings reveal a significant variation in quality among different generation methods. Based on our proposed SFQA dataset, we comprehensively benchmark the current objective quality assessment algorithms.
翻译:说话面部生成任务在数字人与智能体等领域的各类应用中具有巨大潜力。歌唱作为仅次于说话的最常见面部动作之一,可被视为跨越种族与文化的通用语言。然而,由于缺乏歌唱面部数据集,以及歌唱与说话在节奏和幅度方面存在的领域差异,该方向在相关研究领域常被低估。更重要的是,歌唱面部生成的质量往往不尽人意,且因不同适用场景而参差不齐或受限,这促使我们提出及时有效的质量评估方法来保障用户体验。为填补该领域现有空白,本文引入基于12种代表性生成方法构建的新型SFG内容质量评估数据集SFQA。在数据集构建过程中,我们采用100张人物照片或肖像以及来自7种不同风格的36段音乐片段,生成了构成SFQA数据集的5,184个歌唱面部视频。为深入探究SFG方法的质量,评估人员进行了主观质量评估,其评分结果揭示了不同生成方法间存在显著的质量差异。基于我们提出的SFQA数据集,本文对当前客观质量评估算法进行了全面基准测试。