Social robots aim to establish long-term bonds with humans through engaging conversation. However, traditional conversational approaches, reliant on scripted interactions, often fall short in maintaining engaging conversations. This paper addresses this limitation by integrating large language models (LLMs) into social robots to achieve more dynamic and expressive conversations. We introduce a fully-automated conversation system that leverages LLMs to generate robot responses with expressive behaviors, congruent with the robot's personality. We incorporate robot behavior with two modalities: 1) a text-to-speech (TTS) engine capable of various delivery styles, and 2) a library of physical actions for the robot. We develop a custom, state-of-the-art emotion recognition model to dynamically select the robot's tone of voice and utilize emojis from LLM output as cues for generating robot actions. A demo of our system is available here. To illuminate design and implementation issues, we conduct a pilot study where volunteers chat with a social robot using our proposed system, and we analyze their feedback, conducting a rigorous error analysis of chat transcripts. Feedback was overwhelmingly positive, with participants commenting on the robot's empathy, helpfulness, naturalness, and entertainment. Most negative feedback was due to automatic speech recognition (ASR) errors which had limited impact on conversations. However, we observed a small class of errors, such as the LLM repeating itself or hallucinating fictitious information and human responses, that have the potential to derail conversations, raising important issues for LLM application.
翻译:社交机器人旨在通过引人入胜的对话与人类建立长期纽带。然而,传统依赖脚本化交互的对话方法往往难以维持有吸引力的对话。本文通过将大语言模型(LLMs)集成到社交机器人中,以实现更具动态性和表现力的对话,从而解决这一局限性。我们介绍了一个全自动对话系统,该系统利用LLMs生成与机器人个性相一致的、带有表现性行为的机器人响应。我们通过两种模态融合机器人行为:1)能够呈现多种表达风格的文本转语音(TTS)引擎;2)机器人的物理动作库。我们开发了一个定制的、最先进的情绪识别模型,用于动态选择机器人的语调,并利用LLM输出中的表情符号作为生成机器人动作的线索。我们的系统演示可在此处获取。为揭示设计与实现问题,我们开展了一项试点研究,让志愿者使用我们提出的系统与社交机器人聊天,并分析他们的反馈,对聊天记录进行严格的错误分析。反馈结果非常积极,参与者称赞了机器人的同理心、帮助性、自然感和娱乐性。大多数负面反馈源于自动语音识别(ASR)错误,但这些错误对对话的影响有限。然而,我们观察到一小类错误,例如LLM重复自身或虚构虚假信息与人类回应,这些错误有可能破坏对话,引发了关于LLM应用的重要问题。