Human-Machine Interaction (HMI) systems have gained huge interest in recent years, with reference expression comprehension being one of the main challenges. Traditionally human-machine interaction has been mostly limited to speech and visual modalities. However, to allow for more freedom in interaction, recent works have proposed the integration of additional modalities, such as gestures in HMI systems. We consider such an HMI system with pointing gestures and construct a table-top object picking scenario inside a simulated virtual reality (VR) environment to collect data. Previous works for such a task have used deep neural networks to classify the referred object, which lacks transparency. In this work, we propose an interpretable and compositional model, crucial to building robust HMI systems for real-world application, based on a neuro-symbolic approach to tackle this task. Finally we also show the generalizability of our model on unseen environments and report the results.
翻译:人机交互系统近年来受到广泛关注,其中指代表达式理解是主要挑战之一。传统的人机交互主要局限于语音和视觉模态。然而,为允许更自由的交互,近期研究提出在人机交互系统中集成额外模态,例如手势。我们考虑这样一个带指向手势的人机交互系统,并在模拟虚拟现实环境中构建桌面物体拾取场景以收集数据。先前针对此类任务的研究使用深度神经网络对所指对象进行分类,但缺乏可解释性。在本工作中,我们基于神经符号方法提出一种可解释且组合式的模型,这对于构建面向实际应用的鲁棒人机交互系统至关重要。最后,我们还展示了模型在未见环境中的泛化能力,并报告了相关结果。