Large Language Models (LLMs) have revolutionized various industries by harnessing their power to improve productivity and facilitate learning across different fields. One intriguing application involves combining LLMs with visual models to create a novel approach to Human-Computer Interaction. The core idea behind this system is to develop an interactive platform that allows the general public to leverage the capabilities of ChatGPT in their daily lives. This is achieved by integrating several technologies such as Whisper, ChatGPT, Microsoft Speech Services, and the state-of-the-art (SOTA) talking head system, SadTalker, resulting in uTalk, an intelligent AI system. Users will be able to converse with this portrait, receiving answers to whatever questions they have in mind. Additionally, they could use uTalk for content generation by providing an input and their image. This system is hosted on Streamlit, where the user will initially be requested to provide an image to serve as their AI assistant. Then, users could choose whether to have a conversation or generate content based on their preferences. Either way, it starts by providing an input, where a set of operations will be done, and the avatar will provide a precise response. The paper discusses how SadTalker is optimized to improve its running time by 27.72% based on 25FPS generated videos. In addition, the system's initial performance, uTalk, improved further by 9.8% after SadTalker was integrated and parallelized with Streamlit.
翻译:大型语言模型(LLMs)通过发挥其提升生产力和促进各领域学习的能力,已彻底改变了多个行业。一项引人入胜的应用是将LLMs与视觉模型相结合,创建出一种新的人机交互方法。该系统的核心思想是开发一个互动平台,使公众能够在日常生活中利用ChatGPT的功能。通过整合Whisper、ChatGPT、微软语音服务以及最先进的(SOTA)说话头系统SadTalker等多种技术,最终形成了uTalk这一智能AI系统。用户将能够与这一肖像进行对话,并获得对其任何问题的回答。此外,用户还可以通过提供输入和自己的图像来使用uTalk进行内容生成。该系统托管在Streamlit上,用户最初将被要求提供一张图像作为其AI助手。然后,用户可以根据偏好选择进行对话或生成内容。无论选择哪种方式,系统都从提供输入开始,执行一系列操作,最终由虚拟形象给出精确回应。本文讨论了如何优化SadTalker,使其基于25FPS生成视频的运行时间缩短27.72%。此外,在集成SadTalker并与Streamlit并行化后,系统初版uTalk的性能进一步提升9.8%。