Maia: A Real-time Non-Verbal Chat for Human-AI Interaction

Modeling face-to-face communication in computer vision, which focuses on recognizing and analyzing nonverbal cues and behaviors during interactions, serves as the foundation for our proposed alternative to text-based Human-AI interaction. By leveraging nonverbal visual communication, through facial expressions, head and body movements, we aim to enhance engagement and capture the user's attention through a novel improvisational element, that goes beyond mirroring gestures. Our goal is to track and analyze facial expressions, and other nonverbal cues in real-time, and use this information to build models that can predict and understand human behavior. Operating in real-time and requiring minimal computational resources, our approach signifies a major leap forward in making AI interactions more natural and accessible. We offer three different complementary approaches, based on retrieval, statistical, and deep learning techniques. A key novelty of our work is the integration of an artistic component atop an efficient human-computer interaction system, using art as a medium to transmit emotions. Our approach is not art-specific and can be adapted to various paintings, animations, and avatars. In our experiments, we compare state-of-the-art diffusion models as mediums for emotion translation in 2D, and our 3D avatar, Maia, that we introduce in this work, with not just facial movements but also body motions for a more natural and engaging experience. We demonstrate the effectiveness of our approach in translating AI-generated emotions into human-relatable expressions, through both human and automatic evaluation procedures, highlighting its potential to significantly enhance the naturalness and engagement of Human-AI interactions across various applications.

翻译：在计算机视觉中对面对面交流进行建模，重点关注交互过程中非语言线索和行为的识别与分析，这构成了我们提出的基于文本的人机交互替代方案的基础。通过利用非语言视觉交流——包括面部表情、头部和身体动作——我们旨在超越简单的动作镜像，引入一种新颖的即兴创作元素，以增强参与度并吸引用户的注意力。我们的目标是实时跟踪和分析面部表情及其他非语言线索，并利用这些信息构建能够预测和理解人类行为的模型。我们的方法在实时运行的同时仅需极少的计算资源，这标志着在使人工智能交互更加自然和易于访问方面取得了重大进展。我们提出了三种基于检索、统计和深度学习技术的互补方法。本工作的一个关键创新在于，在高效的人机交互系统之上集成了艺术成分，将艺术作为传递情感的媒介。我们的方法并非艺术专用，可适配于各种绘画、动画和虚拟形象。在实验中，我们将最先进的扩散模型作为二维情感传递媒介，与本文提出的三维虚拟形象Maia进行比较——Maia不仅具备面部动作，还包含身体运动，以提供更自然、更具吸引力的体验。通过人工和自动评估程序，我们证明了该方法在将人工智能生成的情感转化为人类可理解表达方面的有效性，凸显了其在各类应用中显著提升人机交互自然度与参与感的潜力。