To enhance human-robot social interaction, it is essential for robots to process multiple social cues in a complex real-world environment. However, incongruency of input information across modalities is inevitable and could be challenging for robots to process. To tackle this challenge, our study adopted the neurorobotic paradigm of crossmodal conflict resolution to make a robot express human-like social attention. A behavioural experiment was conducted on 37 participants for the human study. We designed a round-table meeting scenario with three animated avatars to improve ecological validity. Each avatar wore a medical mask to obscure the facial cues of the nose, mouth, and jaw. The central avatar shifted its eye gaze while the peripheral avatars generated sound. Gaze direction and sound locations were either spatially congruent or incongruent. We observed that the central avatar's dynamic gaze could trigger crossmodal social attention responses. In particular, human performances are better under the congruent audio-visual condition than the incongruent condition. Our saliency prediction model was trained to detect social cues, predict audio-visual saliency, and attend selectively for the robot study. After mounting the trained model on the iCub, the robot was exposed to laboratory conditions similar to the human experiment. While the human performances were overall superior, our trained model demonstrated that it could replicate attention responses similar to humans.
翻译:为增强人机社会交互,机器人需在复杂真实环境中处理多种社会线索。然而,跨模态输入信息的不一致性不可避免,且可能给机器人的信息处理带来挑战。针对这一问题,本研究采用神经机器人学框架下的跨模态冲突解决范式,使机器人能够展现类人社会注意行为。在人类研究中,我们招募37名被试开展行为实验。为提升生态效度,实验设计了包含三个动画化身角色的圆桌会议场景——每个化身均佩戴医用口罩以遮蔽鼻、口、下颌区域的面部线索。中心化身转动视线方向的同时,周边化身产生声音信号,视线方向与声音位置在空间上呈现一致或不一致两种条件。研究表明,中心化身的动态视线可触发跨模态社会注意反应,尤其在视听一致条件下人类表现显著优于不一致条件。针对机器人研究,我们训练了显著性预测模型以检测社会线索、预测视听显著性并进行选择性注意。将该训练模型部署至仿人机器人iCub后,使其在近似人类实验的实验室条件下运行。尽管人类表现整体更优,但训练模型成功复制了与人类相似的注意响应模式。