Teaching Unknown Objects by Leveraging Human Gaze and Augmented Reality in Human-Robot Interaction

Robots are becoming increasingly popular in a wide range of environments due to their exceptional work capacity, precision, efficiency, and scalability. This development has been further encouraged by advances in Artificial Intelligence, particularly Machine Learning. By employing sophisticated neural networks, robots are given the ability to detect and interact with objects in their vicinity. However, a significant drawback arises from the underlying dependency on extensive datasets and the availability of substantial amounts of training data for these object detection models. This issue becomes particularly problematic when the specific deployment location of the robot and the surroundings, are not known in advance. The vast and ever-expanding array of objects makes it virtually impossible to comprehensively cover the entire spectrum of existing objects using preexisting datasets alone. The goal of this dissertation was to teach a robot unknown objects in the context of Human-Robot Interaction (HRI) in order to liberate it from its data dependency, unleashing it from predefined scenarios. In this context, the combination of eye tracking and Augmented Reality created a powerful synergy that empowered the human teacher to communicate with the robot and effortlessly point out objects by means of human gaze. This holistic approach led to the development of a multimodal HRI system that enabled the robot to identify and visually segment the Objects of Interest in 3D space. Through the class information provided by the human, the robot was able to learn the objects and redetect them at a later stage. Due to the knowledge gained from this HRI based teaching, the robot's object detection capabilities exhibited comparable performance to state-of-the-art object detectors trained on extensive datasets, without being restricted to predefined classes, showcasing its versatility and adaptability.

翻译：机器人因其卓越的工作能力、精确性、效率及可扩展性，在各类环境中日益普及。人工智能（尤其是机器学习）的进步进一步推动了这一发展。通过采用复杂的神经网络，机器人得以检测并交互其邻近区域内的物体。然而，一个显著缺陷源于这些物体检测模型对大规模数据集及海量训练数据的依赖。当机器人的具体部署位置及周围环境未知时，这一问题尤为突出。物体的庞大且不断增长的多样性使得仅依赖现有数据集几乎不可能全面覆盖所有物体。本文的目标是在人机交互（HRI）背景下教授机器人未知物体，从而使其摆脱数据依赖，从预设场景中解放出来。在此背景下，眼动追踪与增强现实的结合产生强大协同效应，使人类教师能够与机器人沟通，并借助人类注视轻松指出物体。这种整体方法促成了一套多模态人机交互系统的开发，使机器人能够在三维空间中识别并视觉分割感兴趣物体。通过人类提供的类别信息，机器人得以学习这些物体，并在后续阶段重新检测它们。得益于这一基于人机交互的教导所获得的知识，机器人的物体检测能力表现出与在大型数据集上训练的最新物体检测器相当的性能，且不受预定义类别的限制，展示了其多功能性和适应性。