Large Language Models (LLMs) have revolutionized various industries by harnessing their power to improve productivity and facilitate learning across different fields. One intriguing application involves combining LLMs with visual models to create a novel approach to Human-Computer Interaction. The core idea of this system is to create a user-friendly platform that enables people to utilize ChatGPT's features in their everyday lives. uTalk is comprised of technologies like Whisper, ChatGPT, Microsoft Speech Services, and the state-of-the-art (SOTA) talking head system SadTalker. Users can engage in human-like conversation with a digital twin and receive answers to any questions. Also, uTalk could generate content by submitting an image and input (text or audio). This system is hosted on Streamlit, where users will be prompted to provide an image to serve as their AI assistant. Then, as the input (text or audio) is provided, a set of operations will produce a video of the avatar with the precise response. This paper outlines how SadTalker's run-time has been optimized by 27.69% based on 25 frames per second (FPS) generated videos and 38.38% compared to our 20FPS generated videos. Furthermore, the integration and parallelization of SadTalker and Streamlit have resulted in a 9.8% improvement compared to the initial performance of the system.
翻译:大型语言模型(LLMs)通过发挥其提升生产力、促进不同领域学习的能力,已对多个行业产生革命性影响。其中一项引人入胜的应用是将LLMs与视觉模型结合,创建人机交互的新方法。该系统的核心理念是构建一个用户友好平台,使人们能够在日常生活中利用ChatGPT的功能。uTalk集成了Whisper、ChatGPT、微软语音服务(Microsoft Speech Services)以及最先进的(SOTA)说话头部系统SadTalker等技术。用户可以与数字孪生体进行类人对话,并获得任何问题的答案。此外,uTalk还可通过提交图像及输入(文本或音频)来生成内容。该系统部署于Streamlit平台,用户将被提示提供一张图像作为其AI助手。随后,随着输入(文本或音频)的提供,一系列操作将生成包含精确响应的虚拟人视频。本文阐述了如何将SadTalker的运行时间优化了27.69%(基于25帧每秒(FPS)生成的视频),相比20FPS生成的视频优化了38.38%。此外,将SadTalker与Streamlit进行集成与并行化处理,相比系统初始性能提升了9.8%。