In this paper, we introduce a robotic agent specifically designed to analyze external environments and address participants' questions. The primary focus of this agent is to assist individuals using language-based interactions within video-based scenes. Our proposed method integrates video recognition technology and natural language processing models within the robotic agent. We investigate the crucial factors affecting human-robot interactions by examining pertinent issues arising between participants and robot agents. Methodologically, our experimental findings reveal a positive relationship between trust and interaction efficiency. Furthermore, our model demonstrates a 2\% to 3\% performance enhancement in comparison to other benchmark methods.
翻译:本文介绍了一种专门设计用于分析外部环境并回答参与者问题的机器人智能体。该智能体的核心目标是在视频场景中通过语言交互辅助人类。我们提出的方法将视频识别技术与自然语言处理模型集成于机器人智能体内。通过分析参与者与机器人智能体之间出现的相关问题,我们探究了影响人机交互的关键因素。从方法论层面来看,实验结果表明信任与交互效率呈正相关关系。此外,我们的模型相较于其他基准方法实现了2%至3%的性能提升。