Recent advancements in multimodal Human-Robot Interaction (HRI) datasets have highlighted the fusion of speech and gesture, expanding robots' capabilities to absorb explicit and implicit HRI insights. However, existing speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing, revealing limitations in scaling to intricate domains and prioritizing human command data over robot behavior records. To bridge these gaps, we introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures that are natural, synchronized with robot behavior demonstrations. NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and we demonstrate its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures. We have released our dataset, simulator, and code to facilitate future research in human-robot interaction system learning; access these resources at https://www.snehesh.com/natsgd/
翻译:近年来,多模态人机交互数据集的发展凸显了语音与手势的融合,拓展了机器人吸收显性和隐性人机交互见解的能力。然而,现有的语音-手势人机交互数据集往往聚焦于基础任务(如指向和推动物体),在扩展到复杂领域时表现出局限性,且优先采集人类指令数据而忽视机器人行为记录。为弥补这些差距,我们提出NatSGD——一个多模态人机交互数据集,包含通过语音与手势表达的自然人类指令,并与机器人行为演示同步对齐。NatSGD作为机器学习与人机交互研究交叉领域的基础资源,我们通过训练机器人理解多模态人类指令的任务,验证其有效性,并强调了联合考虑语音与手势的重要性。我们已公开数据集、模拟器及代码以促进人机交互系统学习的未来研究;资源获取地址:https://www.snehesh.com/natsgd/