SGGNet$^2$: Speech-Scene Graph Grounding Network for Speech-guided Navigation

from arxiv, 7 pages, 6 figures, Paper accepted for the Special Session at the 2023 International Symposium on Robot and Human Interactive Communication (RO-MAN), [Dohyun Kim, Yeseung Kim, Jaehwi Jang, and Minjae Song] contributed equally to this work

The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately grounding language utterances gives a significant challenge due to the acoustic variability in speakers' voices and environmental noise. In this work, we propose a novel speech-scene graph grounding network (SGGNet$^2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. To incorporate the acoustic similarity, we extend our previous grounding model, the scene-graph-based grounding network (SGGNet), with the ASR model from NVIDIA NeMo. We accomplish this by feeding the latent vector of speech pronunciations into the BERT-based grounding network within SGGNet. We evaluate the effectiveness of using latent vectors of speech commands in grounding through qualitative and quantitative studies. We also demonstrate the capability of SGGNet$^2$ in a speech-based navigation task using a real quadruped robot, RBQ-3, from Rainbow Robotics.

翻译：口语语言是一种易用且高效的交互界面，使非专业用户及残障用户能够与复杂辅助机器人进行交互。然而，由于说话者声音的声学变异性和环境噪声的干扰，准确关联语言表述仍面临重大挑战。本文提出一种新颖的语音-场景图关联网络（SGGNet$^2$），通过利用自动语音识别（ASR）系统中正确识别与错误识别词语间的声学相似性，实现对语音表述的鲁棒关联。为融入声学相似性，我们采用NVIDIA NeMo的ASR模型扩展了先前基于场景图的关联网络（SGGNet）基础架构。具体通过将语音发音的潜在向量输入SGGNet中基于BERT的关联网络实现这一目标。我们通过定性与定量研究验证了在关联过程中使用语音指令潜在向量的有效性，并采用彩虹机器人公司的四足机器人RBQ-3，在基于语音的导航任务中展示了SGGNet$^2$的实际能力。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日