Human-Like Gaze Behavior in Social Robots: A Deep Learning Approach Integrating Human and Non-Human Stimuli

Nonverbal behaviors, particularly gaze direction, play a crucial role in enhancing effective communication in social interactions. As social robots increasingly participate in these interactions, they must adapt their gaze based on human activities and remain receptive to all cues, whether human-generated or not, to ensure seamless and effective communication. This study aims to increase the similarity between robot and human gaze behavior across various social situations, including both human and non-human stimuli (e.g., conversations, pointing, door openings, and object drops). A key innovation in this study, is the investigation of gaze responses to non-human stimuli, a critical yet underexplored area in prior research. These scenarios, were simulated in the Unity software as a 3D animation and a 360-degree real-world video. Data on gaze directions from 41 participants were collected via virtual reality (VR) glasses. Preprocessed data, trained two neural networks-LSTM and Transformer-to build predictive models based on individuals' gaze patterns. In the animated scenario, the LSTM and Transformer models achieved prediction accuracies of 67.6% and 70.4%, respectively; In the real-world scenario, the LSTM and Transformer models achieved accuracies of 72% and 71.6%, respectively. Despite the gaze pattern differences among individuals, our models outperform existing approaches in accuracy while uniquely considering non-human stimuli, offering a significant advantage over previous literature. Furthermore, deployed on the NAO robot, the system was evaluated by 275 participants via a comprehensive questionnaire, with results demonstrating high satisfaction during interactions. This work advances social robotics by enabling robots to dynamically mimic human gaze behavior in complex social contexts.

翻译：非语言行为，特别是注视方向，在增强社交互动中的有效沟通方面起着至关重要的作用。随着社交机器人越来越多地参与这些互动，它们必须根据人类活动调整其注视，并对所有线索（无论是否由人类产生）保持敏感，以确保无缝且有效的沟通。本研究旨在提高机器人与人类在各种社交情境下（包括人类与非人类刺激，例如对话、指向、开门和物体掉落）注视行为的相似性。本研究的一个关键创新在于探究对非人类刺激的注视反应，这是先前研究中一个关键但探索不足的领域。这些场景在Unity软件中模拟为3D动画和360度真实世界视频。通过虚拟现实（VR）眼镜收集了41名参与者的注视方向数据。预处理后的数据用于训练两个神经网络——LSTM和Transformer——以基于个体的注视模式构建预测模型。在动画场景中，LSTM和Transformer模型的预测准确率分别为67.6%和70.4%；在真实世界场景中，LSTM和Transformer模型的准确率分别为72%和71.6%。尽管个体间注视模式存在差异，我们的模型在准确率上优于现有方法，同时独特地考虑了非人类刺激，这相比先前文献提供了显著优势。此外，该系统部署于NAO机器人上，通过一份综合问卷由275名参与者进行评估，结果显示交互过程中的满意度很高。这项工作通过使机器人能够在复杂社交情境中动态模仿人类注视行为，推动了社交机器人学的发展。