Large Language Models (LLMs) have significantly advanced user-bot interactions, enabling more complex and coherent dialogues. However, the prevalent text-only modality might not fully exploit the potential for effective user engagement. This paper explores the impact of multi-modal interactions, which incorporate images and audio alongside text, on user engagement in chatbot conversations. We conduct a comprehensive analysis using a diverse set of chatbots and real-user interaction data, employing metrics such as retention rate and conversation length to evaluate user engagement. Our findings reveal a significant enhancement in user engagement with multi-modal interactions compared to text-only dialogues. Notably, the incorporation of a third modality significantly amplifies engagement beyond the benefits observed with just two modalities. These results suggest that multi-modal interactions optimize cognitive processing and facilitate richer information comprehension. This study underscores the importance of multi-modality in chatbot design, offering valuable insights for creating more engaging and immersive AI communication experiences and informing the broader AI community about the benefits of multi-modal interactions in enhancing user engagement.
翻译:大型语言模型(LLMs)显著推动了用户与机器人之间的交互,实现了更复杂、更连贯的对话。然而,目前普遍采用的纯文本模态可能无法充分发挥有效用户参与度的潜力。本文探讨了在聊天机器人对话中,结合图像和音频与文本的多模态交互对用户参与度的影响。我们利用多样化的聊天机器人和真实用户交互数据进行了综合分析,采用留存率和对话长度等指标来评估用户参与度。我们的研究结果表明,与纯文本对话相比,多模态交互显著提升了用户参与度。值得注意的是,引入第三种模态能显著增强参与度,其效果超越了仅使用两种模态所带来的益处。这些结果表明,多模态交互优化了认知处理过程,并促进了更丰富的信息理解。本研究强调了多模态在聊天机器人设计中的重要性,为创造更具吸引力和沉浸感的人工智能交流体验提供了宝贵见解,并向更广泛的人工智能社区揭示了多模态交互在提升用户参与度方面的益处。