Reacting like Humans: Incorporating Intrinsic Human Behaviors into NAO through Sound-Based Reactions for Enhanced Sociability

Robots' acceptability among humans and their sociability can be significantly enhanced by incorporating human-like reactions. Humans can react to environmental events very quickly and without thinking. An instance where humans display natural reactions is when they encounter a sudden and loud sound that startles or frightens them. During such moments, individuals may instinctively move their hands, turn toward the origin of the sound, and try to determine the event's cause. This inherent behavior motivated us to explore this less-studied part of social robotics. In this work, a multi-modal system composed of an action generator, sound classifier, and YOLO object detector was designed to sense the environment and, in the presence of sudden loud sounds, show natural human fear reactions, and finally, locate the fear-causing sound source in the environment. These unique and valid generated motions and inferences could imitate intrinsic human reactions and enhance the sociability of robots. For motion generation, a model based on LSTM and MDN networks was proposed to synthesize various motions. Also, in the case of sound detection, a transfer learning model was preferred that used the spectrogram of sound signals as its input. After developing individual models for sound detection, motion generation, and image recognition, they were integrated into a comprehensive fear module that was implemented on the NAO robot. Finally, the fear module was tested in practical application and two groups of experts and non-experts filled out a questionnaire to evaluate the performance of the robot. Given our promising results, this preliminary exploratory research provides a fresh perspective on social robotics and could be a starting point for modeling intrinsic human behaviors and emotions in robots.

翻译：机器人在人类中的可接受性及其社交能力可通过融入类人反应显著增强。人类能够无需思考地快速对环境事件做出反应。当人类遭遇突发的巨大声响感到惊吓或恐惧时，便会展现出自然的反应行为。在此类情境下，个体可能会本能地移动双手、转向声音来源方向，并试图判断事件成因。这种固有行为促使我们探索社交机器人领域中这一较少被研究的方向。本研究设计了一个由动作生成器、声音分类器和YOLO物体检测器组成的多模态系统，用于感知环境并在突发巨响时展现人类自然的恐惧反应，最终定位环境中引发恐惧的声源。这些独特且有效的生成动作与推理能够模仿人类内在反应，提升机器人的社交能力。在动作生成方面，提出了一种基于LSTM和MDN网络的模型以合成多样化动作；在声音检测方面，优先采用以声音信号频谱图为输入的迁移学习模型。完成声音检测、动作生成和图像识别各模块开发后，将它们集成到完整的恐惧模块中并部署于NAO机器人。最后，该恐惧模块在实际应用场景中测试，由专家与非专家两组人员通过问卷评估机器人表现。基于我们令人鼓舞的结果，这项初步探索性研究为社交机器人提供了新视角，并可能成为在机器人中建模人类内在行为与情感的起点。