UGotMe: An Embodied System for Affective Human-Robot Interaction

Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://lipzh5.github.io/HumanoidVLE/.

翻译：为人形机器人配备理解人类交互对象情感状态并根据情境恰当表达情感的能力，对于实现情感化人机交互至关重要。然而，将当前基于视觉的多模态情感识别模型应用于现实世界的情感人机交互时，会引发具身性挑战：需要解决环境噪声问题并满足实时性要求。首先，在多参与者对话场景中，机器人视觉观测中固有的噪声（可能来自1）场景中的干扰物体，或2）出现在机器人视野内的非活跃说话者）会阻碍模型从视觉输入中提取情感线索。其次，实时响应作为交互系统的理想特性，同样难以实现。为应对这两项挑战，我们提出了专门针对多参与者对话设计的UGotMe情感人机交互系统。为解决第一个问题，系统整合了两种去噪策略：具体而言，为过滤场景中的干扰物体，我们提出从原始图像中提取说话者面部图像，并引入定制的主动面部提取策略以排除非活跃说话者。针对第二个问题，我们采用从机器人到本地服务器的高效数据传输机制以提升实时响应能力。我们将UGotMe部署于名为Ameca的人形机器人，在实际场景中验证了其实时推理能力。展示实际部署效果的视频可在 https://lipzh5.github.io/HumanoidVLE/ 查看。