Language-enabled robots have been widely studied over the past years to enable natural human-robot interaction and teaming in various real-world applications. Language-enabled robots must be able to comprehend referring expressions to identify a particular object from visual perception using a set of referring attributes extracted from natural language. However, visual observations of an object may not be available when it is referred to, and the number of objects and attributes may also be unbounded in open worlds. To address the challenges, we implement an attribute-based compositional zero-shot learning method that uses a list of attributes to perform referring expression comprehension in open worlds. We evaluate the approach on two datasets including the MIT-States and the Clothing 16K. The preliminary experimental results show that our implemented approach allows a robot to correctly identify the objects referred to by human commands.
翻译:过去数年间,具备语言能力的机器人已被广泛应用于各类真实场景,以实现自然的人机交互与协同合作。这类机器人需能够理解指代表达式,通过从自然语言中提取的一组指代属性,借助视觉感知识别特定物体。然而,当物体被指代时,其视觉观测信息可能并不存在;同时,在开放世界中,物体与属性的数量也可能不受限制。为应对这些挑战,我们实现了一种基于属性的组合式零样本学习方法,该方法利用属性列表在开放世界中执行指代表达式理解。我们在MIT-States和Clothing 16K两个数据集上对方法进行了评估。初步实验结果表明,我们实现的方法能够使机器人准确识别人类指令所指代的物体。