In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.
翻译:在人机交互快速发展的背景下,将视觉能力整合到对话代理中是一项关键进步。本文提出了一种对话管理器的初步实现,该管理器利用大型语言模型(如GPT-4、IDEFICS)的最新进展,通过实时视觉输入增强传统的基于文本的提示。大型语言模型被用于解释文本提示和视觉刺激,从而创建更具上下文感知能力的对话代理。系统的提示工程结合了对话与图像摘要,确保了上下文保留与计算效率之间的平衡。报告了六次与基于该系统驱动的Furhat机器人的交互,说明并讨论了获得的结果。通过实现这一视觉增强对话系统,本文展望了一个未来,即对话代理能够无缝融合文本与视觉模态,实现更丰富、更具上下文感知能力的对话。