MoVEInt: Mixture of Variational Experts for Learning Human-Robot Interactions from Demonstrations

Shared dynamics models are important for capturing the complexity and variability inherent in Human-Robot Interaction (HRI). Therefore, learning such shared dynamics models can enhance coordination and adaptability to enable successful reactive interactions with a human partner. In this work, we propose a novel approach for learning a shared latent space representation for HRIs from demonstrations in a Mixture of Experts fashion for reactively generating robot actions from human observations. We train a Variational Autoencoder (VAE) to learn robot motions regularized using an informative latent space prior that captures the multimodality of the human observations via a Mixture Density Network (MDN). We show how our formulation derives from a Gaussian Mixture Regression formulation that is typically used approaches for learning HRI from demonstrations such as using an HMM/GMM for learning a joint distribution over the actions of the human and the robot. We further incorporate an additional regularization to prevent "mode collapse", a common phenomenon when using latent space mixture models with VAEs. We find that our approach of using an informative MDN prior from human observations for a VAE generates more accurate robot motions compared to previous HMM-based or recurrent approaches of learning shared latent representations, which we validate on various HRI datasets involving interactions such as handshakes, fistbumps, waving, and handovers. Further experiments in a real-world human-to-robot handover scenario show the efficacy of our approach for generating successful interactions with four different human interaction partners.

翻译：共享动力学模型对于捕捉人机交互（HRI）固有的复杂性和可变性至关重要。因此，学习此类共享动力学模型能够增强协调性与适应性，从而实现与人类伙伴的成功反应式交互。本研究提出一种新颖方法，以专家混合（Mixture of Experts）的方式从演示数据中学习HRI的共享潜空间表示，从而根据人类观测反应式生成机器人动作。我们训练变分自编码器（VAE）学习机器人运动，并通过混合密度网络（MDN）构建能捕捉人类观测多模态性的信息化潜空间先验进行正则化。我们展示了该公式如何从高斯混合回归公式推导而来——后者通常用于基于演示的HRI学习方法（例如使用HMM/GMM学习人与机器人动作的联合分布）。我们进一步引入额外正则化以防止“模式坍塌”，这是在VAE中使用潜空间混合模型时的常见现象。实验表明，相较于以往基于HMM或循环神经网络的共享潜表示学习方法，我们采用基于人类观测的MDN先验信息训练VAE的方法能生成更精确的机器人运动。我们在包含握手、击拳、挥手及物品传递等多种交互形式的HRI数据集上验证了该结论。在真实世界人机物品传递场景中的进一步实验表明，我们的方法能与四位不同人类交互伙伴成功生成有效交互。