The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately grounding language utterances gives a significant challenge due to the acoustic variability in speakers' voices and environmental noise. In this work, we propose a novel speech-scene graph grounding network (SGGNet$^2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. To incorporate the acoustic similarity, we extend our previous grounding model, the scene-graph-based grounding network (SGGNet), with the ASR model from NVIDIA NeMo. We accomplish this by feeding the latent vector of speech pronunciations into the BERT-based grounding network within SGGNet. We evaluate the effectiveness of using latent vectors of speech commands in grounding through qualitative and quantitative studies. We also demonstrate the capability of SGGNet$^2$ in a speech-based navigation task using a real quadruped robot, RBQ-3, from Rainbow Robotics.
翻译:口语语言是一种易用且高效的交互界面,使非专业用户及残障用户能够与复杂辅助机器人进行交互。然而,由于说话者声音的声学变异性和环境噪声的干扰,准确关联语言表述仍面临重大挑战。本文提出一种新颖的语音-场景图关联网络(SGGNet$^2$),通过利用自动语音识别(ASR)系统中正确识别与错误识别词语间的声学相似性,实现对语音表述的鲁棒关联。为融入声学相似性,我们采用NVIDIA NeMo的ASR模型扩展了先前基于场景图的关联网络(SGGNet)基础架构。具体通过将语音发音的潜在向量输入SGGNet中基于BERT的关联网络实现这一目标。我们通过定性与定量研究验证了在关联过程中使用语音指令潜在向量的有效性,并采用彩虹机器人公司的四足机器人RBQ-3,在基于语音的导航任务中展示了SGGNet$^2$的实际能力。