UGotMe：一种用于情感人机交互的具身系统 (UGotMe: An Embodied System for Affective Human-Robot Interaction)

Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://pi3-141592653.github.io/UGotMe/.

翻译：赋予人形机器人理解人类交互者情感状态并根据情境恰当表达情感的能力，对于实现情感人机交互至关重要。然而，将当前基于视觉的多模态情感识别模型应用于现实世界的情感人机交互，面临着具身化挑战：需要解决环境噪声问题并满足实时性要求。首先，在多参与者对话场景中，机器人视觉观测中固有的噪声（可能来自1）场景中的干扰物体，或2）出现在机器人视野内的非活跃说话者）会阻碍模型从视觉输入中提取情感线索。其次，实时响应作为交互系统的理想特性，也颇具挑战性。为应对这两项挑战，我们引入了一个专为多参与者对话设计的情感人机交互系统UGotMe。系统整合了两种去噪策略以解决第一个问题。具体而言，为过滤场景中的干扰物体，我们提出从原始图像中提取说话者的人脸图像，并引入一种定制的主动人脸提取策略以排除非活跃说话者。针对第二个问题，我们采用从机器人到本地服务器的高效数据传输以提升实时响应能力。我们将UGotMe部署于名为Ameca的人形机器人上，以验证其在实际场景中的实时推理能力。展示实际部署的视频可在 https://pi3-141592653.github.io/UGotMe/ 获取。