With the rapid development of artificial intelligence (AI), digital humans have attracted more and more attention and are expected to achieve a wide range of applications in several industries. Then, most of the existing digital humans still rely on manual modeling by designers, which is a cumbersome process and has a long development cycle. Therefore, facing the rise of digital humans, there is an urgent need for a digital human generation system combined with AI to improve development efficiency. In this paper, an implementation scheme of an intelligent digital human generation system with multimodal fusion is proposed. Specifically, text, speech and image are taken as inputs, and interactive speech is synthesized using large language model (LLM), voiceprint extraction, and text-to-speech conversion techniques. Then the input image is age-transformed and a suitable image is selected as the driving image. Then, the modification and generation of digital human video content is realized by digital human driving, novel view synthesis, and intelligent dressing techniques. Finally, we enhance the user experience through style transfer, super-resolution, and quality evaluation. Experimental results show that the system can effectively realize digital human generation. The related code is released at https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker.
翻译:随着人工智能(AI)的快速发展,数字人日益受到关注,有望在多个行业实现广泛应用。然而,现有数字人仍主要依赖设计师手动建模,过程繁琐且开发周期长。因此,面对数字人技术兴起,亟需结合AI的数字人生成系统以提高开发效率。本文提出一种面向智能数字人生成的多模态融合系统实现方案。具体而言,系统以文本、语音和图像作为输入,利用大语言模型(LLM)、声纹提取及文本转语音技术合成交互式语音;对输入图像进行年龄变换,选取合适图像作为驱动图像;通过数字人驱动、新视角合成及智能换装技术实现数字人视频内容的修改与生成。最后,通过风格迁移、超分辨率重建及质量评估增强用户体验。实验结果表明,该系统能够有效实现数字人生成。相关代码已开源至https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker。